Processing of multidimensional data in distributed systems for solving task of tsunami waves modeling S. Iu. Sveshnikova Saint Petersburg State University, 35, University ave., Peterhof, St. Petersburg, 198504, Russia E-mail: st012967@student.spbu.ru Many applied research of geography and oceanology require big data processing. One of these problems it is tsunami waves modeling. This task involves dynamic re-interpolation of bathymetry data on multiple grids of different scales. It is determined by the distance from the coastline and exist- ence of islands along the front of wave. In this work re-interpolation is implemented by applying par- allel programming primitives to multidimensional arrays of data, which are distributed across the computer cluster nodes. It allows to work effectively with data that does not fit into the memory of one compute node. In addition to this, it improves processing speed compared to the sequential program. NetCDF format that use for storage bathymetry data it is hierarchical format and he hasn't ready-made solutions for processing in distributed systems. Paper views alternative solutions and use one of these for solve given task. Keywords: big data, distributed computing, geodata, NetCDF, re-interpolation The work was supported by grants of Russian Foundation for Basic Research (projects no. 16-07-01111, 16-07-00886, 16- 07-01113) and Saint Petersburg State University (project no. 0.37.155.2014) © 2016 Svetlana Iu. Sveshnikova 458 Introduction At present time solutions for work with big-data evolved very fast and used in many subject fields. One of them it is processing geodata. Many tasks need geodata, in this case, it is modeling tsu- nami waves. Underwater earthquake is usually the reason of tsunami. For modeling defines two main stages of waves propagation. First modeling stage it is wave propagation in the main sea. The speed of the wave propagation depends on the depth of the ocean and acceleration of gravity. On the stage, computations may be executed on the main grid. Second stage views wave behavior near shore. The length of wave is reduced when wave is aproach to coastline because depth becomes less. In this re- gard is necessary to reduce the step of the computational grid. Thus, re-interpolation of the computa- tional grid it is important factor for modeling. Quality modeling allows to give appropriately assess- ment the scale of the disaster and to adopt right actions for the protection of peoples and strategically important objects. The speed of computing — also is important for solving this task. The faster tsuna- mi will be detected, the more time it will have services for the adoption of protective measures. Related works Book [Poplavskii, Khramushin, …, 1997] discuss tsunami questions. The first section views methods that used for detection tsunami waves and measure level of disaster. The second section says about computing experiments that allow making prediction and modeling tsunami waves. Also, book contains information about two author computer program “Ani” and “Mario”. SciHadoop [Buck, Watkins, …, 2011] it is the plugin for Hadoop that implements execution log- ic queries to geodata uses information about their structure. Several solutions allow achieve it. File format for work with multidimensional arrays. He defines key concepts: shape array and corner point. Self-query sintax was created for work with the own file format. And finally, plugin corrects process partition in Hadoop scheduler. Paper [Biookaghazadeh, Xu, …, 2015] describe software solution for native support NetCDF files by Hadoop. Problem statement When are happen tsunami waves modeling then runs dynamic re-interpolation for computational grid. Grid gets from NetCDF files. There is existed the program that used in Far East of Russia [Хра- мушин, 2010]. It is the old system with rich functionality, but its architecture solution stops its evolu- tion. This program doesn't work with files that more than 2 gigabytes and can't run in distributed mode. Early it handles with the processing of available local data. At the time appeared new file re- sult of altimetric surveys [Olser, Becken, Sandwell, 2014] that contains full Earth grid with 15- second resolution. Its size is 14 gigabytes. It will allow forecasting tsunami by the more far signal. New program should meet the following requirements: x processing NetCDF files x work in distributed mode x big files processing (more 14 gb) This paper views the organization of interpolation functionality for this program and problem pro- cessing geodata in programs from big-data stack. Main content Fast processing big data is impossible without distributed computing systems. Apache Spark was selected for this tool. It is the very popular framework for distributed computing. The core of Apache 459 Spark it is the structure that names RDD (resilient distributed dataset). RDD contains key-value pairs. You can execute transformations and actions on all dataset and it will run at the one moment on all cluster nodes. Apache Spark is oriented for non-structured streaming data and it is difficult for work with NetCDF files. Let's view more detalis about that. NetCDF it is the scientific format that used for de- scribe of geodata. It contains two blocks: metadata and data. For geodata in the metadata section de- fines terms like as dimensional, variables, global attributes and some others. Data block often contains multidimensional arrays. For example data about depth is contained in 3-dimensional array: x-axis, y- axis, and depths. NetCDF uses random access to file. You can get information from middle of file and it not requires read all file from the beginning. When you work with Apache Spark then you can read only all file from begin to end and you can't select metadata or some dimension from array. Fig. 1. Scheme of NetCDF file The problem of processing hierarchical files in batch and stream processing systems is solving for a long time. Some of the solutions were view in related works. In this case uses new library SciSpark. SciSpark it is project supported by Apache Foundation and NASA Laboratory, that works with NetCDF files used linear algebra libraries: Breeze and ND4J [Palamuttam et al., 2015]. SRDD structure is the key concept of SciSpark. SRDD it is RDD analog, that oriented on the scientific data processing, in particular, NetCDF. SRDD consists of the several partitions, each of those contains structure that named sciTensor. The sciTensor is a self-document array collection that was developed for sRDD transformations. The goal of the sciTensor is to provide logic that defines the data in the multi-dimensional format. Each sciTensor contains metadata in the key-value form and AbstractTensor that stored data. AbstractTensor is a base class for BreezeTensor or ND4JTensor and used it for to provide a common interface to different linear algebra libraries. Breeze library are the more older project, but he supports only 2-dimensional arrays. ND4J are the project for support opera- tions in linear algebra with multi-dimensional arrays. There are talks about program workflow. User defines the pair of coordinates that is the corner point of rectangle and step for the new grid. The program cut part of data by this coordinates and run interpolation function for this data. In the current time uses Lagrange interpolation function. The result of calculations may be saved to NetCDF or text file or was re -used for the next operations. 460 Fig. 2. sRDD structure Conclusion and future work The paper talks about processing geodata system based on distributed architecture and big data stack technologies. Integration NetCDF and Spark was performed using SciSpark library and tested on the task of re interpolation grid. The study may serve as the prototype for new tsunami forecast - system. References Поплавский А.А., Храмушин В.Н., Непоп К.И. и др. // Оперативный прогноз наводнений на мор- ских берегах Дальнего Востока России — Южно-Сахалинск: ДВО РАН. — 1997. Poplavskii A.A., Chramushin V.N., Nepop K.I. et al. // Operativnyj prognoz navodnenij na morskikh beregakh Dalnego Vostoka Rossii [Operational forecast of flooding on the sea coast of the Russian Far East] // Iuzhno-Sakhalinsk: FED RAS, 1997 (in Russian). Buck J.B., Watkins N., LeFevre J. et al. SciHadoop: Array-based query processing in Hadoop // Pro- ceedings of 2011 International Conference for High Performance Computing, Networking, Stor- age and Analysis — ACM, 2011. — P. 66. Biookaghazadeh S., Xu Y., Zhou Sh. et al. Enabling scientific data storage and processing on big-data systems // Big Data (Big Data), 2015 IEEE International Conference on — IEEE, 2015. — P. 1978–1984. Храмушин В.Н. Программно-вычислительный комплекс Ani, 2010. Номер гос. регистрации 2010615848. URL: http://shipdesign.ru/SoftWare/2010615848.html. Khramushin V.N. Program for PC Ani, 2010. Number of government registration 2010615848. URL: http://shipdesign.ru/SoftWare/2010615848.html (in Russian). Olson C.J., Becker J.J., Sandwell D.T. A new global bathymetry map at 15 arcsecond resolution for resolving seafloor fabric: SRTM15_PLUS //AGU Fall Meeting Abstracts. — 2014. — Vol. 1. — P. 03. Palamuttam R. et al. SciSpark: Applying in -memory distributed computing to weather event detection and tracking // Big Data (Big Data), 2015 IEEE International Conf erence on IEEE, 2015. — — P. 2020 2026. 461