Applying Visual Analysis Procedures to Multidimensional Medical Data A.E. Bondarev 1,V.A. Galaktionov1 bond@keldysh.ru|vlgal@gin.keldysh.ru 1 Keldysh Institute of Applied Mathematics RAS, 125047 Miusskaya sq. 4, Moscow, Russia The paper considers the tasks of visual analysis of multidimensional data sets of medical origin. For visual analysis, the approach of building elastic maps is used. The elastic maps are used as the methods of original data points mapping to enclosed manifolds having less dimensionality. Diminishing the elasticity parameters one can design map surface which approximates the multidimensional dataset in question much better. To improve the results, a number of previously developed procedures are used - preliminary data filtering, removal of separated clusters (flotation). To solve the scalability problem, when the elastic map is adjusted both to the region of condensation of data points and to separately located points of the data cloud, the quasi-Zoom approach is applied. The illustrations of applying elastic maps to various sets of medical data are presented. Keywords: multidimensional data, visual analysis, elastic maps, quasi-Zoom. phototechnics. The results of applying these procedures to 1. Introduction multidimensional volumes of data of various origins are presented in [1-4]. In the analysis of multidimensional data a special place is This approach is generally universal, since it does not depend occupied by the task of classification. When solving on the nature of the studied multidimensional data. This makes it classification problems, the approaches of visual analytics are possible to apply this approach and the developed procedures to very useful. They are the synthesis of several algorithms for the tasks of studying multidimensional medical data. This paper reducing the dimension and the visual presentation of represents the results of applying the construction of elastic maps multidimensional data in manifolds of a lower dimension nested and procedures developed earlier for the visual analysis of in the original volume. These algorithms include the display of multidimensional data volumes of medical origin. the original multidimensional volume in elastic maps [8, 9, 18] In most of the previous cases, we considered data sets that with different properties of elasticity. These methods allow to get were specially prepared in advance. Here, for the first time, we insight of the cluster structure contained in the initial took several sets of publicly available medical data sets [16]. multidimensional data volume under question. Some results were previously presented in [3]. Our team became interested in elastic maps in the process of implementing a project to develop computational technologies 2. Elastic maps approach for building, processing, analyzing and visualizing multidimensional parametric solutions of CFD problems. The ideology and algorithms for construction of elastic maps Computational technology is implemented in the form of a single are described in detail [8, 9, 18]. Elastic map is a system of technological pipeline of algorithms for the production, elastic springs embedded in a multidimensional data space. This processing, visualization and analysis of multidimensional data. approach is based on an analogy with the problems of mechanics: Such pipeline can be considered as a prototype of a generalized the main manifold passing through the "middle" of the data can computational experiment for non-stationary problems of be represented as an elastic membrane or plate. The method of computational gas dynamics. As a result, such a generalized elastic maps is formulated as an optimization problem, which computational experiment makes it possible to obtain a solution assumes optimization of a given functional from the relative not for a single individual problem, but for a whole class of location of the map and data. problems, defined by ranges of variation of the determining According to [18], the basis for constructing an elastic map parameters. It should also be noted the universality of such is a two-dimensional rectangular grid G embedded in a approach. It can be applied to a wide range of problems of multidimensional space that approximates the data and has mathematical modeling of non-stationary processes. The adjustable elastic properties with respect to stretching and description of the elements of the implemented computing bending. The location of the grid nodes is sought as a result of technology is given in [5, 6]. solving the optimization problem for finding the minimum of the In practice, elastic maps turned out to be a useful and quite functional: versatile tool, which made it possible to apply them to 𝐷1 𝐷2 𝐷3 multidimensional data volumes of various types. This approach 𝐷= +𝜆 +𝜇 → 𝑚𝑖𝑛 , |𝑋| 𝑚 𝑚 was applied to the tasks of analyzing textual information, where the frequencies of using words [1] were used as numerical where │X│ is the number of points in the multidimensional data characteristics, as well as to the tasks of analyzing mineral volume X; m is the number of grid nodes, λ and μ are the elastic samples [11]. In the process of working on these tasks, a number coefficients responsible for the stretching and curvature of the of procedures for processing the studied data were developed and mesh. Here D1, D2, D3 are the terms responsible for the properties tested, which made it possible to improve the results of visual of the grid. The term D1 is a measure of the proximity of the grid analysis. These procedures include the preliminary filtering of nodes to the data. The term D2 represents the measure of the data, which allows weeding out points with indistinctly defined stretching of the grid. The term D3 represents the measure of the values, the removal of separated clusters (flotation), quasi-Zoom. curvature of the grid. The latter procedure is designed to solve the problem of The author of the approach [18] has developed the software scalability, when the elastic map adapts both to the area of data package [17], which allows the construction and visual points concentration and to separately located points of the data presentation of elastic maps. The main functional features of this cloud, which complicates visual analysis. The essence of this software are described in detail in [18]. The figures below in this technological approach is that for finer adjustment it is necessary article are created by means of this software package. to select large clusters in the studied volume of multidimensional data and build elastic maps for selected clusters separately, thus organizing an effect similar to the zoom function in modern Copyright © 2019 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). 3. Procedures for visual analysis Previously, to study multidimensional data, a number of procedures for processing the studied data were developed, which allowed to improve the results of visual analysis. These procedures include the preliminary filtering of data, which allows weeding out points with indistinctly defined values, the removal of separated clusters (flotation), quasi-Zoom. Below we briefly give examples of the application of these procedures to multidimensional volumes of data of different origin. An example of constructing elastic maps for the volume of multidimensional data representing the characteristics of mineral resources, namely, three types of coal from Polish deposits [11], is given in [4]. Multidimensional data are considered, representing points in the multidimensional feature space (characteristics of coal samples). The data set displays three grades of coal. The task of classifying coal by grade was considered. By combining the construction of elastic maps, the Fig. 2. Extension of the elastic map for the transposed data set removal of fuzzy points and separated classes (filtering and after applying quasi-Zoom. flotation of data), it is possible to completely separate the samples specified in the initial volume into three classes Also, the construction of elastic maps was applied to the corresponding to three types of coal. study of multidimensional arrays of errors of different solvers Examples of the use of quasi-Zoom for analyzing the thematic compared to the etalon solution [4]. We considered the numerical proximity of the words of the Russian language are given in [1, results of comparing the accuracy of the work of various solvers 2, 4]. The basis of the proposed method is the analysis of the of the OpenFOAM software package using the example of the environment of words. The main hypothesis is that similar words well-known inviscid flow problem around a cone at zero angle should occur in approximately the same context. In this regard, of attack. The results obtained using various OpenFOAM solvers in the space of attributes, they will be located at a relatively close were compared with the well-known numerical solution of this distance from each other, while the different words will be problem with the variation of the free-stream Mach number and located at a distance more distant from each other. Text boxes the angle of the cone. Four solvers of OpenFOAM software from news sources were used as test data (news feeds for a certain package - rhoCentralFoam, pisoCentralFoam, sonicFoam, period). For the primary tests, about 100 verbs with 353 nouns rhoPimpleFoam participated in the comparison. All these solvers associated with them were selected. The data thus obtained was have different approximation and computational properties. further considered as a multidimensional data volume, Figure 3 shows the elastic map for pressure, obtained as a result representing 100 points in 353-dimensional space. The numerical of parametric calculations, in the space of the first principal values of the resulting matrix are defined as frequencies of components. The yellow circles show the results for sharing. The data volume under study contained a region of high rhoCentralFoam solver, the red ones for pisoCentralFoam, the data density and points far enough from this region. In the study green ones for sonicFoam and the blue ones for rhoPimpleFoam. of the frequency of the joint use of verbs and nouns, the practical task was set as follows. It was necessary to separate the "stuck together" points. The use of filtering and two consecutive quasi- Zoom procedures allowed to solve this problem completely (Fig.1). Fig. 3. Elastic map for the array of errors for different OpenFOAM solvers. The results of the visual analysis showed that the errors for rhoCentralFoam and for pisoCentralFoam can be roughly approximated by a plane reflecting the dependence of the error on the Mach number and the cone angle. 4. Processing of medical datasets Fig. 1. Extension of the elastic map after two consecutive quasi- Zoom applications. The attempt of applying elastic maps to medical data was made in [2]. For this purpose the data from [13] were used. This The use of a similar approach for the transposed data file allowed data set contains values for six biomechanical features used to us to select among the set of nouns a number of semantic clusters classify orthopaedic patients into 2 classes (normal or abnormal). (Fig.2). This opens up additional opportunities for the analysis Each patient is represented in the data set by six biomechanical and interpretation of semantic groups for specialists in this field. attributes derived from the shape and orientation of the pelvis and lumbar spine (in this order): pelvic incidence, pelvic tilt, lumbar lordosis angle, sacral slope, pelvic radius and grade of spondylolisthesis. The data set contains 310 points in 6- dimensional space. Unfortunately, elastic maps didn’t give good results from the point of view of classification. Below are the results for the three other volumes of multidimensional data that involve the solution of the classification problem. All data sets were taken from UCI Machine Learning Repository [16]. The first data set considers variability of impedivity in normal and pathological breast tissue [10] and tasks of classifying various types of diseases [14]. This dataset contains 106 points placed in 9-dimensional attribute space. Also each point has its class attribute corresponding to the type of disease - carcinoma, fibro-adenoma, mastopathy, glandular, connective, adipose. According to [14], the dataset can be used for predicting the classification of either the original 6 classes or of 4 classes by merging together the fibro-adenoma, mastopathy and glandular classes whose discrimination is not important (they cannot be Fig. 6. Extension of elastic map for source data. accurately discriminated anyway). Further, we use the following notation and color scheme for Figures show that (car + fad +) and (con + adi ) pairs of the classes studied: car (carcinoma) - red, adi (adipose) - yellow, classes are well separated. However, within the pair, data from con (connective) - green, fad + (fibro-adenoma + mastopathy + different classes are mixed. To improve the picture of the glandular) - blue. We use the combined fad + class because of separation, use flotation and remove fad +. The results of the above remark by the authors of the volume of data that these building an elastic map for this case are shown in Figure 7. In classes are not separated exactly. this case, the car class was fully distinguished. Below one can see the illustrations of the construction of elastic maps for the studied data volume. Figure 4 shows the source data in the space of the first three principal components. Figures 5 and 6 show the elastic map and its development for a given amount of data. Fig. 7. Extension of elastic map for classes car, con, adi. Now remove the car class and consider separately the remaining pair of classes - con and adi. After constructing the Fig. 4. Source data in the space of the first principal elastic map and its development, we obtain the picture presented components. in Figure 8. In this case, a satisfactory separation of classes was achieved. Fig. 5. Elastic map for source data. Fig. 8. Extension of elastic map for classes con, adi. Next, consider together a couple of classes - car and fad +. Figure 9 presents the extension of the elastic map for these classes. There is also a satisfactory separation. The use of q- Zoom in order to improve the separation in the center of the picture did not lead to success. Also, the attempt to divide the mixed fad + class into the fad, mas, gla classes was not successful. The comment in [14] about the inseparability of these classes turned out to be true. Fig. 11. Extension of elastic map for 10-dimensional attribute space. However, in the original article [12] a picture was given from which it was possible to conclude that only for 4 parameters (glucose, Insulin, Resistin, HOMA-homeostasis model assessment) there is a significant difference between patients and healthy people. From the data space, only these 4 dimensions Fig. 9. Extension of elastic map for classes car, fad+. were left, and the elastic map was re-constructed. The results are shown in Fig. 12. The separation between the green and red dots The following data set is also devoted to the problems of has improved significantly, however, in the center of the picture forecasting breast diseases [7, 12]. The data set contains 116 there is an area where the dots are mixed. points in a 10 -dimensional attribute space. Each point also contains a binary variable indicating the presence or absence of the disease. Attribute space contains ten predictors. According to [12], the predictors are anthropometric data and parameters which can be gathered in routine blood analysis. Prediction models based on these predictors, if accurate, can potentially be used as a biomarker of breast cancer. For this data volume, an elastic map was constructed. Dots with the absence of the disease are shown in green, and the presence of the disease is marked in red. Figures 10 and 11 represent the constructed elastic map and its extension. As one can see, the green and red dots are strongly mixed. This caused some confusion, since by construction this picture represents points that have to be close to each other in the multidimensional attribute space. Fig. 12. Extension of elastic map for 4-dimensional attribute space. The following dataset is for the early diagnosis of the Autistic Spectrum Disorder (ASD) [15]. The data set consists of 692 points originally defined in the 21-dimensional attribute space. The diagnostic approach is based on the analysis of the questionnaire data consisting of 10 questions. About half of the attributes are patient data. Therefore, it was decided to leave 12 attributes - 10 answers to the questionnaire, the age of the patient and the total score according to the results of the questionnaire. Fig. 10. Elastic map for 10-dimensional attribute space. The results are presented in Figures 13 and 14 in the form of an elastic map and its scan. the total variance, followed by the removal of unnecessary criteria. 6. References [1] Bondarev, A.E. et al, 2016. Visual analysis of clusters for a multidimensional textual dataset. Scientific Visualization. 8(3), 1-24. [2] Bondarev, A.E., 2017. Visual analysis and processing of clusters structures in multidimensional datasets. ISPRS Archives, XLII-2/W4, 151-154. [3] Bondarev, A. E.: The procedures of visual analysis for multidimensional data volumes, Int. Arch. Photogramm. Remote Sens. Spatial Inf. Sci., XLII-2/W12, 17-21, doi.org/10.5194/isprs-archives-XLII-2-W12-17-2019 Fig. 13. Elastic map for 12-dimensional attribute space when [4] Bondarev, A.E., Bondarenko, A.V., Galaktionov, V.A., diagnosing ASD. 2018. Visual analysis procedures for multidimensional data. Scientific Visualization 10(4), 109 - 122, doi.org/10.26583/sv.10.4.09. [5] Bondarev, A.E., Galaktionov, V.A., 2015a. Analysis of Space-Time Structures Appearance for Non-Stationary CFD Problems. Procedia Computer Science, 51, 1801– 1810. [6] Bondarev, A.E., Galaktionov, V.A., 2015b. Multidimensional data analysis and visualization for time- dependent CFD problems. Programming and Computer Software, 41(5), 247–252, doi.org/10.1134/S0361768815050023. [7] Crisóstomo, J. et al., 2016. Hyperresistinemia and metabolic dysregulation: a risky crosstalk in obese breast cancer. Endocrine, 53(2), 433-442, doi.org/10.1007/s12020-016- 0893-x [8] Gorban, A. et al, 2007. Principal Manifolds for Data Visualisation and Dimension Reduction, Springer, Berlin – Heidelberg – New York, 2007. [9] Gorban A., Zinovyev A., 2010. Principal manifolds and graphs in practice: from molecular biology to dynamical systems. International Journal of Neural Systems, 20(3), 219–232. Fig. 14. Extension of elastic map for 12-dimensional attribute [10] Jossinet, J., 1996. Variability of impedivity in normal and space when diagnosing ASD. pathological breast tissue. Med. & Biol. Eng. & Comput, 34, 346-350. These results show that the separation between diagnoses [11] Niedoba, T., 2014. Multi-parameter data visualization by about the presence or absence of ASD is quite satisfactory on the means of principal component analysis (PCA) in qualitative studied data set. evaluation of various coal types / Physicochemical Problems of Mineral Processing, 50(2), 575-589. [12] Patrício, M., et al 2018. Using Resistin, glucose, age and 5. Conclusions BMI to predict the presence of breast cancer. BMC Cancer, For the analysis of structures in multidimensional data 18(1), doi.org/10.1186/s12885-017-3877-1. volumes, technologies for constructing elastic maps are used, [13] Rocha Neto, A., Barreto, G., 2009. On the Application of which are methods for mapping points of the original Ensembles of Classifiers to the Diagnosis of Pathologies of multidimensional space to nested manifolds of lower dimension. the Vertebral Column: A Comparative Analysis, IEEE Latin A number of data processing techniques that can improve the America Transactions, 7(4), 487-496. results are considered - pre-filtering of data, removal of separated [14] Silva, J.E., Marques de Sá, J.P., Jossinet, J., 2000. clusters (flotation), quasi-Zoom. Examples of the construction of Classification of Breast Tissue by Electrical Impedance elastic maps and the use of these procedures for Spectroscopy. Med & Bio Eng & Computing, 38, 26-30. multidimensional data of medical origin are given. The results [15] Thabtah, F., 2017. Machine learning in autistic spectrum showed that the construction of elastic maps together with the disorder behavioral research: A review and ways forward. procedures of accompanying data processing can serve as a Informatics for Health and Social Care, doi.org/ · useful tool for visual data analysis and complement other 10.1080/17538157.2017.1399132 methods for studying multidimensional data volumes. [16] UCI Machine Learning Repository, 2019. However, the results show that when processing medical data archive.ics.uci.edu/ml/ (01 March 2019). from open sources, we are faced with a new problem. The data [17] ViDaExpert, 2019. bioinfo.curie.fr/projects/vidaexpert (01 considered are clearly overloaded with unnecessary March 2019). measurements and unnecessary information. This makes the data [18] Zinovyev, A., 2000. Vizualizacija mnogomernyh dannyh “noisy” and does not allow class division. To overcome this [Visualization of multidimensional data]. Krasnoyarsk, problem, it is planned in the future to implement an additional publ. NGTU. 2000. 180 p. [In Russian]. procedure for analyzing the contribution of each measurement to