Processing of multidimensional data in distributed systems
       for solving task of tsunami waves modeling
                                             S. Iu. Sveshnikova
         Saint Petersburg State University, 35, University ave., Peterhof, St. Petersburg, 198504, Russia
                                         E-mail: st012967@student.spbu.ru


      Many applied research of geography and oceanology require big data processing. One of these
problems it is tsunami waves modeling. This task involves dynamic re-interpolation of bathymetry
data on multiple grids of different scales. It is determined by the distance from the coastline and exist-
ence of islands along the front of wave. In this work re-interpolation is implemented by applying par-
allel programming primitives to multidimensional arrays of data, which are distributed across the
computer cluster nodes. It allows to work effectively with data that does not fit into the memory of one
compute node. In addition to this, it improves processing speed compared to the sequential program.
NetCDF format that use for storage bathymetry data it is hierarchical format and he hasn't ready-made
solutions for processing in distributed systems. Paper views alternative solutions and use one of these
for solve given task.

      Keywords: big data, distributed computing, geodata, NetCDF, re-interpolation

The work was supported by grants of Russian Foundation for Basic Research (projects no. 16-07-01111, 16-07-00886, 16-
07-01113) and Saint Petersburg State University (project no. 0.37.155.2014)


                                                                                           © 2016 Svetlana Iu. Sveshnikova


                                                                                                                   458
Introduction
      At present time solutions for work with big-data evolved very fast and used in many subject
fields. One of them it is processing geodata. Many tasks need geodata, in this case, it is modeling tsu-
nami waves. Underwater earthquake is usually the reason of tsunami. For modeling defines two main
stages of waves propagation. First modeling stage it is wave propagation in the main sea. The speed of
the wave propagation depends on the depth of the ocean and acceleration of gravity. On the stage,
computations may be executed on the main grid. Second stage views wave behavior near shore. The
length of wave is reduced when wave is aproach to coastline because depth becomes less. In this re-
gard is necessary to reduce the step of the computational grid. Thus, re-interpolation of the computa-
tional grid it is important factor for modeling. Quality modeling allows to give appropriately assess-
ment the scale of the disaster and to adopt right actions for the protection of peoples and strategically
important objects. The speed of computing — also is important for solving this task. The faster tsuna-
mi will be detected, the more time it will have services for the adoption of protective measures.


Related works
      Book [Poplavskii, Khramushin, …, 1997] discuss tsunami questions. The first section views
methods that used for detection tsunami waves and measure level of disaster. The second section says
about computing experiments that allow making prediction and modeling tsunami waves. Also, book
contains information about two author computer program “Ani” and “Mario”.
      SciHadoop [Buck, Watkins, …, 2011] it is the plugin for Hadoop that implements execution log-
ic queries to geodata uses information about their structure. Several solutions allow achieve it. File
format for work with multidimensional arrays. He defines key concepts: shape array and corner point.
Self-query sintax was created for work with the own file format. And finally, plugin corrects process
partition in Hadoop scheduler. Paper [Biookaghazadeh, Xu, …, 2015] describe software solution for
native support NetCDF files by Hadoop.


Problem statement
      When are happen tsunami waves modeling then runs dynamic re-interpolation for computational
grid. Grid gets from NetCDF files. There is existed the program that used in Far East of Russia [Хра-
мушин, 2010]. It is the old system with rich functionality, but its architecture solution stops its evolu-
tion. This program doesn't work with files that more than 2 gigabytes and can't run in distributed
mode. Early it handles with the processing of available local data. At the time appeared new file re-
sult of altimetric surveys [Olser, Becken, Sandwell, 2014] that contains full Earth grid with 15-
second resolution. Its size is 14 gigabytes. It will allow forecasting tsunami by the more far signal.
New program should meet the following requirements:
    x processing NetCDF files
    x work in distributed mode
    x big files processing (more 14 gb)
This paper views the organization of interpolation functionality for this program and problem pro-
cessing geodata in programs from big-data stack.


Main content
     Fast processing big data is impossible without distributed computing systems. Apache Spark was
selected for this tool. It is the very popular framework for distributed computing. The core of Apache


                                                                                                    459
Spark it is the structure that names RDD (resilient distributed dataset). RDD contains key-value pairs.
You can execute transformations and actions on all dataset and it will run at the one moment on all
cluster nodes.
      Apache Spark is oriented for non-structured streaming data and it is difficult for work with
NetCDF files. Let's view more detalis about that. NetCDF it is the scientific format that used for de-
scribe of geodata. It contains two blocks: metadata and data. For geodata in the metadata section de-
fines terms like as dimensional, variables, global attributes and some others. Data block often contains
multidimensional arrays. For example data about depth is contained in 3-dimensional array: x-axis, y-
axis, and depths. NetCDF uses random access to file. You can get information from middle of file and
it not requires read all file from the beginning. When you work with Apache Spark then you can read
only all file from begin to end and you can't select metadata or some dimension from array.


                                         Fig. 1. Scheme of NetCDF file

     The problem of processing hierarchical files in batch and stream processing systems is solving
for a long time. Some of the solutions were view in related works. In this case uses new library
SciSpark. SciSpark it is project supported by Apache Foundation and NASA Laboratory, that works
with NetCDF files used linear algebra libraries: Breeze and ND4J [Palamuttam et al., 2015]. SRDD
structure is the key concept of SciSpark. SRDD it is RDD analog, that oriented on the scientific data
processing, in particular, NetCDF. SRDD consists of the several partitions, each of those contains
structure that named sciTensor. The sciTensor is a self-document array collection that was developed
for sRDD transformations. The goal of the sciTensor is to provide logic that defines the data in the
multi-dimensional format. Each sciTensor contains metadata in the key-value form and
AbstractTensor that stored data. AbstractTensor is a base class for BreezeTensor or ND4JTensor and
used it for to provide a common interface to different linear algebra libraries. Breeze library are the
more older project, but he supports only 2-dimensional arrays. ND4J are the project for support opera-
tions in linear algebra with multi-dimensional arrays.
      There are talks about program workflow. User defines the pair of coordinates that is the corner
point of rectangle and step for the new grid. The program cut part of data by this coordinates and run
interpolation function for this data. In the current time uses Lagrange interpolation function. The result
of calculations may be saved to NetCDF or text file or was re -used for the next operations.


                                                                                                    460
                                                  Fig. 2. sRDD structure


Conclusion and future work
     The paper talks about processing geodata system based on distributed architecture and big data
stack technologies. Integration NetCDF and Spark was performed using SciSpark library and tested
on the task of re interpolation grid. The study may serve as the prototype for new tsunami forecast
                 -


system.


References

Поплавский А.А., Храмушин В.Н., Непоп К.И. и др. // Оперативный прогноз наводнений на мор-
    ских берегах Дальнего Востока России — Южно-Сахалинск: ДВО РАН. — 1997.
     Poplavskii A.A., Chramushin V.N., Nepop K.I. et al. // Operativnyj prognoz navodnenij na morskikh beregakh Dalnego
     Vostoka Rossii [Operational forecast of flooding on the sea coast of the Russian Far East] // Iuzhno-Sakhalinsk: FED
     RAS, 1997 (in Russian).
Buck J.B., Watkins N., LeFevre J. et al. SciHadoop: Array-based query processing in Hadoop // Pro-
    ceedings of 2011 International Conference for High Performance Computing, Networking, Stor-
    age and Analysis — ACM, 2011. — P. 66.
Biookaghazadeh S., Xu Y., Zhou Sh. et al. Enabling scientific data storage and processing on big-data
    systems // Big Data (Big Data), 2015 IEEE International Conference on — IEEE, 2015. — P.
    1978–1984.
Храмушин В.Н. Программно-вычислительный комплекс Ani, 2010. Номер гос. регистрации
    2010615848. URL: http://shipdesign.ru/SoftWare/2010615848.html.
     Khramushin V.N. Program for PC Ani, 2010. Number of government registration 2010615848. URL:
     http://shipdesign.ru/SoftWare/2010615848.html (in Russian).
Olson C.J., Becker J.J., Sandwell D.T. A new global bathymetry map at 15 arcsecond resolution for
    resolving seafloor fabric: SRTM15_PLUS //AGU Fall Meeting Abstracts. — 2014. — Vol. 1. —
    P. 03.
Palamuttam R. et al. SciSpark: Applying in -memory distributed computing to weather event detection
    and tracking // Big Data (Big Data), 2015 IEEE International Conf erence on    IEEE, 2015.    —                  —


    P. 2020 2026.


                                                                                                                   461