Proceedings of the VIII International Conference "Distributed Computing and Grid-technologies in Science and Education" (GRID 2018), Dubna, Moscow region, Russia, September 10 - 14, 2018 DATA CONSOLIDATION AND ANALYSIS SYSTEM FOR BRAIN RESEARCH V.I. Volosnikov 1, a, V.V. Korkhov 1, b, A.O. Vorontsov 1, K.V. Gribkov 1, A.B. Degtyarev 1, A.V. Bogdanov 1, N.M. Zalutskaya 2, N.G. Neznanov 2, N.I. Ananyeva 2 1 Saint Petersburg State University, 7/9 Universitetskaya nab., St. Petersburg, 199034, Russia 2 V.M. Bekhterev Psychoneurological Research Institute, St. Petersburg, Russia E-mail: a Volosnikov.apmath@gmail.com, b v.korkhov@spbu.ru Comprehensive studies in the field of brain pathology require strong information support for the consolidation of data from different sources. The heterogeneity of data sources and the resource- intensive nature of preprocessing make it difficult to conduct comprehensive interdisciplinary research. To solve this problem for brain studies, an information system with unified access to heterogeneous data is required. Effective implementation of such a system requires adapting preprocessing methods and creating a model for combining disparate data into a single information environment. We analyze the possibilities and methods of consolidation of clinical and biological data, build a model for the consolidation and interaction of heterogeneous data sources for brain research, implement the model as a cloud service, and provide a data interface in a format encapsulating a complex architecture from the user. We present the design and implementation of an information system; we show and discuss the results of the application of cluster analysis methods to differentiate various types of dementia with MRI data. Our results show that a study of the properties of cluster analysis data can significantly help neurophysiologists in the study of cognitive disorders such as Alzheimer’s disease, especially with the possibilities provided by the proposed information system. Keywords: brain, data analysis, data consolidation, cluster analysis, information system, neuroinformatics, Alzheimer’s disease, cloud computing, service-oriented architecture © 2018 Vladislav I. Volosnikov, Vladimir V. Korkhov, Andrey O. Vorontsov, Kirill V. Gribkov, Alexander B. Degtyarev, Alexander V. Bogdanov, Natalia M. Zalutskaya, Nikolay G. Neznanov, Natalia I. Ananyeva 388 Proceedings of the VIII International Conference "Distributed Computing and Grid-technologies in Science and Education" (GRID 2018), Dubna, Moscow region, Russia, September 10 - 14, 2018 1. The structure of the storage and processing system A wide range of analyzes and measurements are used by specialists during the research of the human brain. In view of the extreme complexity of the development of diseases and disorders, heterogeneous indicators, such as, for example, the results of functional diagnostics, psychological tests, blood tests and DNA, should be considered as a whole without separation from each other. While some data are presented in a relatively simple numerical form, a number of measurements have a complex structure, e.g. MRI and fMRI data containing information about the features of the functioning of the brain. Such data require huge computational power for processing and analysis, large amounts of memory for storage. At this stage of research in V.M. Bekhterev Psychoneurological Institute, medical examination results already occupy more than 20 TB and require approximately 12 hours for the preprocessing of new results on a fairly powerful computer. Neuroinformatics tasks are focused on the creation, storage, processing, simulation and visualization of research results. So, all these stages affect the work with large amounts of data and require the development of special software for efficient operation. For these reasons, the implementation of the cloud approach is necessary for the optimization and expansion of research, which is shown in our previous article as part of a joint project of the V.M. Bekhterev Psychoneurological Institute and St. Petersburg State University [1]. Based on the foregoing, to ensure the effectiveness of research, a cloud system for analyzing and storing data based on computing resources of St. Petersburg State University and Bekhterev Institute is being developed and integrated into the practical work of medical researchers. The mentioned system consists of a number of separate services, as shown in a scheme (Figure 1). Through the use of containerization tools, such as Docker [2], the model of system is very flexible and changeable, which is extremely important for building a virtual data center [3]. In addition, containerization allows implementation of Continuous Integration approach, reducing development and deployment costs. Figure 1. A schema of data storage and processing system The use of service-oriented architecture (SOA) poses to solve a number of problems arising during the creation of the system. One of the requirements is compliance with the law in the field of personal data – the results of the research contain information about patients, which strictly should not go beyond the Bekhterev Institute. Employee data used for authorization in the system should also be stored only at the Institute. Another important advantage of using SOA is the ease of scaling and use of new resources. Due to this, the model of the system may be integrated into existing collaborations on the study of the human brain. 389 Proceedings of the VIII International Conference "Distributed Computing and Grid-technologies in Science and Education" (GRID 2018), Dubna, Moscow region, Russia, September 10 - 14, 2018 2. Data preprocessing and consolidation As mentioned above, the data under study is extremely heterogeneous. Due to the different structure and nature of the values being processed, each type of data requires individual tools and methods for preprocessing. It should be noted that the preprocessing of certain types of data is very costly in terms of time and computational resources – processing of raw MRI results is carrying out with the FreeSurfer package [4] and requires about 12 hours to process one record in the presence of large amounts of RAM. In other cases, there are data difficult to formalize and interpret. Some types, such as EEG, are presented in specific formats requiring a special approach [5]. The transfer of the data preparation process to the cloud system leads to a significant acceleration due to the use of computing devices with the optimal configuration for each specific case. Furthermore, distributed storage and processing systems provide an opportunity to consolidate and analyze data in a complex, which is a requirement for successful research in this subject area. 3. Specialist working environment Through the use of service-oriented architecture, we have the opportunity to freely select technology stacks for different loosely coupled parts of the system. Along with the use of distributed databases and data analysis tools in the Python ecosystem for analyzing and storing heterogeneous data, we use the MEAN (Mongo, Angular, Express, Node) stack to implement the user interface and work environment. The validity of this decision was demonstrated in a previous paper [1]. The primary task of the interface is to provide an intuitive understanding of the process of working with the system, adapting it to the requests and tasks of a specialist in the medical field. Based on the organization of the work of a specialist, the main object in this subsystem is the patient page, which contains the functionality of adding the results of various analyzes and studies, monitoring the fill level and correctness of information (Figure 2). According to the experience of using such systems, automatic entry of the functional diagnostic’ results into the database and control of compliance with a particular patient are a necessary conditions for preserving the integrity and relevance of the data. Therefore, the development of a working environment that meets the requirements set by specialists is one of the priorities. Figure 2. A user interface example 390 Proceedings of the VIII International Conference "Distributed Computing and Grid-technologies in Science and Education" (GRID 2018), Dubna, Moscow region, Russia, September 10 - 14, 2018 4. Application of cluster analysis methods An important task of the developed system is a comprehensive analysis of data and search for correlations in them. To provide this functionality, the architecture provides the possibility of using a wide range of analysis methods – using of single Data API provides transparent access to consolidated data, which makes it possible to avoid various difficulties with different approaches to analysis. An example of implemented methods is a cluster analysis toolkit, which potential in assisting a specialist in making a diagnosis was shown in previous papers on this topic [1, 6]. It should be noted that structural and functional brain changes occur long before the obvious manifestations of cognitive impairment. In this connection, the methods of automatic neuroimaging and analysis are very useful in medical practice. The implemented subsystem provides an opportunity to carry out cluster analysis on various brain lobes and their combinations. Convenient visual presentation, together with automatic statistical analysis, simplify the search for optimal parameters for the partitions of the required significance. Due to lack of information about clusters count and shapes, the main class of used methods are density- based algorithms. The existing articles on this topic also note the high efficiency of other approaches – random SVM and deep learning methods [7, 8]. Due to the flexibility of the architecture, these methods also will be implemented. At this stage of development, the purpose of cluster analysis was to clearly separate the control group from patients with pronounced signs of cognitive impairment, such as Alzheimer’s disease and other dementia types. As can be seen in the graphs obtained as a result of the data processing (Figure 3), the results are consistent with expectations with a sufficient level of statistical significance. Obvious neurodegenerative changes could be separated from conditionally healthy volunteers even without taking into account already known regions of interest (ROI) and patterns. There is reason to believe that working together with experts in the field of neurophysiology on the application of a number of well-known rules and patterns can lead to a significant improvement in results and the introduction of tools into medical practice. Figure 3. A result of temporal lobe analysis 391 Proceedings of the VIII International Conference "Distributed Computing and Grid-technologies in Science and Education" (GRID 2018), Dubna, Moscow region, Russia, September 10 - 14, 2018 5. Conclusions and future work The developed system is necessary for successful research in the field of the human brain. In view of the presence of huge arrays of heterogeneous information that requires a complex analysis, after a certain point, work using standard tools becomes impossible – tasks require the use of fundamentally different approaches applicable to processing Big Data. These changes cease to be quantitative and acquire a qualitative character. The consequence of the above is the transition to distributed cloud computing with the prospect of integration into existing scientific collaborations. In addition, the use of machine analysis methods is also an unavoidable necessity for conducting research of such high complexity. This statement is confirmed by the successes achieved in conducting cluster analysis – one of the tools that can simplify the work of a specialist and minimize his mistakes. The constructed platform opens up broad opportunities for the further development of the project. We plan to expand the scope of the used data, implement a number of highly efficient methods of analysis, develop a decision support system, optimize and expand the functionality of the specialist’s work environment. 6. Acknowledgement The work on data consolidation and analysis system was supported by the grant of Saint Petersburg State University no. 26520170 and the Russian Foundation for Basic Research (RFBR), grant #16-07-00886. References [1] V. Korkhov, V. Volosnikov, A. Vorontsov, K. Gribkov, N. Zalutskaya, A. Degtyarev, A. Bogdanov. Data storage, processing and analysis system to support brain research // Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 2018, vol. 10963, pp. 78–90, ISBN: 978-331962403-7. [2] V. Korkhov, I. Gankevich, A. Degtyarev, A. Bogdanov, V. Gaiduchok, N. Ahmed, A. Cubahiro. Experience in building virtual private supercomputer // Proceedings of International Conference on Computer Science and Information Technologies (CSIT), 2015, pp. 220–223, ISBN: 978-5-8080- 0797-0. [3] A. Bogdanov, A. Degtyarev, V. Korkhov. Desktop supercomputer: what can it do? // Physics of Particles and Nuclei Letters, 2017, vol. 14 (7), pp. 985–992, DOI: 10.1134/S1547477117070032. [4] FreeSurfer // http://surfer.nmr.mgh.harvard.edu/ (accessed: 04 Nov 2018). [5] WinEEG // http://www.mitsar-medical.com/eeg-software/qeeg-software/ (accessed: 04 Nov 2018). [6] Volosnikov V.I. Primenenie metodov klasternogo analiza dlya diagnostirovaniya bolezni Al`czgejmera [The application of cluster analysis methods for diagnosing Alzheimer's disease] // Control Processes and Stability, 2018, vol. 5 (21), pp. 267–271, ISSN: 2313-7304 (in Russian). [7] Xia-an Bi, Qing Shu, Qi Sun, Qian Xu. Random support vector machine cluster analysis of resting- state fMRI in Alzheimer's disease // PLoS One, 2018, vol. 13(3), DOI: 10.1371/journal.pone.0194479. [8] Suk HI, Shen D. Deep Learning-Based Feature Representation for AD/MCI Classification // Med Image Comput Comput Assist Interv., 2013, vol. 16 (pt. 2), pp. 583–590. 392