Methodology for Evaluation the Effectiveness of the System of Dynamic Block Access to Data of Ultra-Large Distributed Remote Sensing Archives Andrey Proshina, Evgeniy Loupiana and Sergey Bartaleva a Space Research Institute of the Russia Academy of Sciences (IKI), Moscow, Russia Abstract The article analyzes the efficiency of the system of dynamic block access to data of ultra- large distributed ERS archives, implemented according to the technology developed at the Space Research Institute. The relevance of organizing just such an option for providing access to heterogeneous satellite data for their joint processing is considered. It also provides a general summary of the implementation of the dynamic block access system. The system under consideration is intended to prepare the required set of data blocks with given characteristics for their processing on a server cluster. The main factors influencing the efficiency of data block preparation procedures are analyzed. The following describes the methodology developed by the authors for estimating the time spent on preparing a given dataset using a specific hardware infrastructure. Keywords 1 Heterogeneous satellite data, satellite data archives, satellite data processing, big data 1. Introduction The provision of data in the form of blocks of fixed spatial partitions is widely used in solving a variety of problems of visualization and data processing by territories. This approach is especially relevant for processing long time series of heterogeneous Earth remote sensing satellite data over large areas in order to obtain regular information products that allow analyzing the dynamics of changes in certain surface characteristics. Such products are currently widely used to solve a wide range of scientific and applied problems related to the monitoring of the natural environment and anthropogenic factors. In the framework of the described approach, the processing of satellite data on the area of interest is carried out in parallel on a set of servers, each of which is provided with a subset of tiles into which this area is divided. The main advantage of this mechanism is the ability to provide the required degree of parallelization of processing, which allows the most efficient use of available computing resources. This approach also makes it possible to effectively implement caching of data tiles in cases where they can be reused, which can significantly reduce the load on both computing resources and used network connections. It should be noted that the block approach is currently used by almost all spatial data visualization systems that provide mass access to such information. Wherein, the most common implementation of the block approach at present is the use of specially organized archives of satellite data, which provide storage of various types of data in a single spatial partition and projection. However, this option works well enough only in those cases when it is necessary to operate with a relatively static and homogeneous set of data, for which a single scheme for organizing the storage and presentation of data can be chosen. At the same time, for the processing and analysis of data over large areas with the joint use of information from different observation systems, such an approach in many cases is difficult to implement and impractical. This is primarily VI International Conference Information Technologies and High-Performance Computing (ITHPC-2021), September 14–16, 2021, Khabarovsk, Russia EMAIL: andry@iki.rssi.ru ORCID: 0000-0003-1470-647X (A. 1); 0000-0001-5943-0695 (A. 2); 0000-0002-4198-4400 (A. 3) ©️ 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings (CEUR-WS.org) 62 due to the fact that the optimal schemes for organizing data storage with different resolutions and observation schemes can differ significantly from each other. The data can be tiled in some regular way for a given projection or split into regular scenes, e.g. by division of a sensing orbit into equal sized overlapping fragments. As an illustration, figure 1 provides examples of contours of various types of satellite data.TERRA/AQUA MODIS data, stored in sinusoidal projection regular granules, is shown in red. Data from the Russian KMSS instrument installed onboard the Meteor-M series satellites is shown in green. Data from Landsat-8 satellite stored in the UTM projection is shown in orange. The depicted heterogeneity of satellite data really sophisticates the development of new data processing routines, and fusion data processing and analysis becomes even more sophisticated. It is important to note here that at present, to solve real problems, data from a large number of various ERS satellite systems are used [1-3] and their diversity only increases from year to year. As a result, it is almost impossible to develop such a regular partition that would be equally effective for different types of data and different tasks for their processing and analysis. At the same time, the simultaneous maintenance of several archives containing the same data in different views leads to very significant overhead costs associated with data storage and preparation. Figure 1: Contours of different satellite data storage schemes. An alternative to the described implementation is the dynamic formation of data blocks in such a spatial division and with such characteristics (spatial resolution, projection, set of channels, etc.) that will be optimal for solving a specific task of processing or visualizing data. This allows use of the main advantages of the block approach for solving specific data processing problems based on the use of existing ultra-large satellite data archives. In this case, dynamic block access can be provided both to data, the storage of which is implemented in the form of certain regular partitions, and to data, the storage of which is not tied to such partitions. In order to implement the above variant was developed a technology for dynamic block access to the satellite data archives of IKI-Monitoring Shared Use Center [4-8]. The distributed archives of this center are built on the basis of using unified satellite data archive management platform for remote monitoring systems development, implemented at the Space Research Institute [9]. The processing of large arrays of remote sensing satellite data, naturally, requires significant computing resources, and also makes high demands on the network connections used in the process of its implementation. Therefore, the task of optimizing the procedures for the dynamic provision of data by blocks is one of the key tasks in the implementation of the above technology. This article is devoted to a comprehensive analysis of the main factors affecting the performance of a dynamic block access system to distributed satellite data archives. These factors include the performance of processing servers and data storage systems, the speed of network connections, the size and type of satellite data provided, etc. The article presents a technique that allows you to quickly assess the resources required to provide a given set of data blocks with the desired characteristics. It allows, at the stage of implementing a new type of processing, to estimate the time spent on providing data 63 blocks, estimate the resources required for this task and work out an optimal solution based on the real performance of the existing infrastructure. 2. Methodology for estimating the time spent on preparing data blocks Within the framework of the developed technology of dynamic block access to data in satellite data archives [7,8], data blocks are prepared in parallel on a cluster of specialized servers, on which direct access to data files in the archive is realized. It is important that such servers must be installed in each of the information nodes of the distributed archive, serving HTTP requests for receiving data from the corresponding local volumes of the archive. The dispatch server is responsible for the formation of the requests themselves, which acts as a smart load balancer. The data block generation procedure is implemented on the basis of the free software gdal. As part of its execution, only the necessary fragments of the original data files are read and a file in the GeoTIFF format is formed with the following specified characteristics: size, projection, resolution, set of channels and compression algorithm. After that, the file is sent over the network to the server, which acts as a disk buffer, which provides caching of prepared data blocks and is the source of the initial information for carrying out one or more processing. To carry out efficient multithreaded processing of satellite data on large clusters of specialized servers, technologies also developed at the Institute for Space Research are used [10,11]. In order to estimate the total preparation time of data required for a specific type of processing, it is necessary to obtain the average execution time of a request to obtain one data block of each type. Based on these characteristics, an estimate of the total time it will take to prepare the required dataset using the available infrastructure can be obtained. The execution time of a request to prepare a block of data depends on many factors, including the above characteristics of the files being prepared, as well as the type of initial data, their storage organization scheme, the speed of access to specific file servers, and the performance of network links. Establishing all dependencies without building a simplified model takes a lot of time and resources, which does not allow you to quickly estimate the required time for a new type of processing. To build such a model, the influence of the most significant of them was analyzed, and a number of simplified dependencies were established that can be used to estimate the preparation time for different samples of initial data. In particular, the times of data transmission over the network, as well as the time required for their compression, do not significantly depend on the type of source data and the organization of their storage. Also, within the framework of the developed model, the dependences of the preparation time of data blocks are used not on their size in geographic coordinates, but on the size in pixels of the image. It is important to note that the simplified model described below is approximate, and the estimates for the option chosen for its use can be refined if necessary. The time of data preparation for one channel of the satellite device was chosen as the main time to be modeled. Below is a formula for estimating this time as the sum of different dependencies that can be established experimentally: T-channel(n, store-type, result-projection, compression-type) = T-base(n, store-type, result-projection) + T-compression(n, compression-type) + T-get(n, compression-type) Where: • n - is the linear size of the received data block in pixels, assuming that the blocks are square, • store-type - a set of characteristics of source files, the key of which are their projection and the characteristic size of fragments, • result-projection - projection of the received data, • compression-type - used compression algorithm The first term in the T-base formula describes the base (main) time of formation of the data block, depending on its size, projection and fragmentation of the source data and the projection in which the results are to be obtained. It is essential that the dependence obtained for a specific type of data can be applied to other types of satellite data similar in terms of storage organization. The second (optional) term T-compression corresponds to the time required to compress the received data by block and depends only on the block size and the compression algorithm used. The third term T-get determines 64 the time it takes to transfer the resulting file over the network, and depends only on the size of the data. Further in the article, the main methods for evaluating each of the dependencies considered above are briefly presented with examples of the dependencies obtained. 2.1. Experimental derivation of the T-base dependence To obtain the basic dependence of the preparation time of data blocks on their size in pixels, it is necessary to obtain this time for various values of this size. It is important to note that the preparation time for a data block significantly depends on the speed of access to a particular file storage system, on which the initial data necessary for its formation are located. The difference in the performance of different storage systems can be associated with the peculiarities of their hardware implementation, the degree of workload and the bandwidth of network communication channels. Also, the time of formation of data blocks may depend on a number of other factors, such as the geographical coordinates of the requested area and the degree of its fullness with data. Therefore, it is necessary for each block size to obtain the average time value for all data blocks that must be provided for processing. Within the framework of the presented methodology, to estimate this time, the average value is taken over a set of randomly selected data blocks related to the desired sample. At the same time, for each block size, this sample is generated anew, which avoids caching of reading the original data files, which can significantly distort the results. Figure 2 shows examples of basic dependencies obtained during the formation of blocks according to the data of the OLI_TIRS instrument installed on the Landsat series satellites for the period from 2013 to 2021 and a range of coordinates approximately corresponding to the territory of Russia. Dependencies are given for the case with the preservation of the UTM projection and for the case with the result being obtained in the geographic projection. The conducted studies have shown that when collecting sufficient statistics, the obtained dependences are well approximated by polynomials of the second degree, which makes it possible, on the basis of a small number of measurements for different block sizes in pixels, to accurately estimate the preparation time for any such size. The fact that when the block size decreases, the preparation time does not tend to zero is associated with the presence of overhead costs for each launch of the data block generation procedure. It is important to note that a significant drop in the efficiency of disk read and write operations with a decrease in the amount of data with which they operate makes a significant contribution to almost all investigated dependencies. In order to evaluate the efficiency of data generation procedures for different block sizes, it is possible to obtain the dependence of a fixed data area preparation time on the block size in pixels. An example of such a dependence is shown in Figure 3. It can be seen from it that the specific time decreases with an increase in the block size, but for large values of this size this effect practically disappears. 2.2. Experimental derivation of the T-compression dependence The dependence of the compression time on the block size can be obtained by subtracting the dependence obtained for preparation procedure without compression from the dependence of the preparation time for blocks using a given compression algorithm. An illustration of this approach is shown in Figure 4. It shows curves corresponding to the preparation time of data without compression, using different compression algorithms supported by gdal software, and curves corresponding to the desired T-compression dependence. For illustration, the fastest LZW compression algorithm and the slower and more efficient DEFLATE algorithm are selected. The required dependencies are shown in the graph with dashed lines. Note that the resulting dependences of the time spent on compressing blocks on their size are also well approximated by polynomials of the second degree. 65 Figure 2: Examples of derived base dependencies Figure 3: Examples of dependencies of fixed data area (million pixels) preparation time on block size Figure 4: Examples of getting dependencies T-compression The degree of data compression using each of the above algorithms was also evaluated. When compressing blocks that are completely filled with data, the compression percentage for the 66 DEFLATE algorithm is stable at around 23%, and for the LZW algorithm, the file size even increases by about 3%. However, the use of the LZW algorithm may still be advisable when a significant part of the data blocks is not completely filled with data and previous versions of the gdal software are used, in which the storage of such data is implemented inefficiently. 2.3. Experimental derivation of the T-get dependence Since the download time of data files using the standard HTTP protocol practically does not depend on their content, we can first establish the dependence of the download time of files on their size in megabytes, which is typical for the existing infrastructure. The resulting dependence is very well approximated by a linear function, using which you can plot the dependence of the data transfer rate on the size of the files. An example of this dependence is shown in the figure 5. As the file size increases, the data transfer rate tends to the maximum bandwidth of the communication channel. Further, Figure 6 shows examples of the required dependences of the download time of data blocks on their size in pixels using various compression algorithms, which are approximated by polynomials of the second degree with coefficients calculated on the basis of the established linear dependence. Figure 5: An example of the dependence of the file transfer rate on its size Figure 6: Examples of the obtained dependences of the transmission time of a data block on its size and compression type 67 3. Results Based on a detailed analysis of the main factors affecting the performance of the system of dynamic block access to remote sensing archives data, a methodology was developed that allows to quickly estimate the time cost of preparing the required set of different types of satellite data using a given set of hardware. In this case, the key elements of the presented methodology are algorithms for the experimental derivation of a number of basic dependencies. The estimates obtained for one computational thread allow you to get an approximate time of data preparation for a given number of computational nodes. As mentioned above, the mass processing of satellite data requires very significant computing resources and time. Therefore, even at the stage of planning a new processing it is necessary to estimate the necessary computational resources for its implementation in a given time frame. The presented method allows you to estimate the resources required to prepare a given set of data at the required speed. In this case, the resulting estimates can be used to select the optimal size of the data block and their characteristics. 4. Acknowledgements The methodology research and development were performed in the frame of “Big data in space research: astrophysics, Solar system, geosphere” project (state reg. №0024-2019-0014). The evaluation and testing stages of the studies were carried out using the resources of the Center for Shared Use of Scientific Equipment "Center for Processing and Storage of Scientific Data of the Far Eastern Branch of the Russian Academy of Sciences", funded by the Russian Federation represented by the Ministry of Science and Higher Education of the Russian Federation under project No. 075-15- 2021-663. 5. References [1] Loupian E.A., Bourtsev M.A., Proshin A.A., Kobets D.A. Evolution of remote monitoring information systems development concepts // Actual Problems of Remote Sensing of the Earth from Space. 2018. Vol. 15. No. 3. P. 53-66. DOI: DOI: 10.21046/2070-7401-2018-15-3-53-66 [2] Euroconsult, Broshure «Satellites to be built & launched by 2026», 2017, URL: http://www.euroconsult-ec.com/research/satellites-built-launched-by-2026-brochure.pdf. [3] Zhu L. et al. A Review: Remote Sensing Sensors. – IntecOpen, 2018. [4] Proshin A.A., Bourtsev M.A., Balashov I.V., Loupian E.A., Radchenko M.V., Sychugov I.G. “IKI-Monitoring” shared use center support and development — possible solutions // Sovremennye problemy distantsionnogo zondirovaniya Zemli iz kosmosa. 2020. Vol. 17. No. 6. P. 51-55. DOI: 10.21046/2070-7401-2020-17-6-51-55. [5] Loupian E.A., Proshin A.A., Bourtsev M.A., Kashnitskiy A.V., Balashov I.V., Bartalev S.A., Konstantinjva A.M., Kobets D.A., Mazurov A.A., Marchenkov V.V., Matveev A.M., Radchenko M.V., Sychugov I.G., Tolpin V.A., Uvarov I.A. Experience of development and operation of the IKI-Monitoring center for collective use of systems for archiving, processing and analyzing satellite data // Actual Problems of Remote Sensing of the Earth from Space. 2019. Vol. 16. No. 3. P. 151-170. DOI: 10.21046/2070-7401-2019-16-3-151-170. [6] Loupian E.A., Proshin A.A., Bourtsev M.A., Balashov I.V., Bartalev S.A., Efremov V. Yu., Kashnitskiy A.V., Mazurov A.A., Matveev A.M., Sudneva O.A., Sychugov I.G., Tolpin V.A., Uvarov I.A. IKI center for collective use of satellite data archiving,processing and analysis systems aimed at solving the problems of environmental study and monitoring // Actual Problems of Remote Sensing of the Earth from Space. 2015. Vol.12. No 5. P. 263-284. [7] Proshin A.A., Matveev A.M., Kashnitskiy A.V., Bourtsev M.A. Satellite data efficient processing with dynamic block archive access // Sovremennye problemy distantsionnogo zondirovaniya Zemli iz kosmosa. 2020. Vol. 17. No. 6. P. 56-60. DOI: 10.21046/2070-7401- 2020-17-6-56-60. 68 [8] Proshin A.A., Loupian E.A., Balashov I.V., Kashnitskiy A.V., Matveev A.M., Rutkevich B.P. Technology of satellite data dynamic block provision to distributed processing systems // Actual Problems of Remote Sensing of the Earth from Space. 2020. Vol. 17. No. 7. P. 79-93. DOI: 10.21046/2070-7401-2020-17-7-79–93. [9] Proshin A.A., Loupian E.A., Balashov I.V., Kashnitskiy A.V., Bourtsev M.A. Unified satellite data archive management platform for remote monitoring systems development // Actual Problems of Remote Sensing of the Earth from Space. 2016. Vol. 13. No. 3. P. 9-27. DOI: 10.21046/2070-7401-2016-13-3-9-27 [10] Kobets D.A., Matveev A.M., Proshin A.A., Mazurov A.A. Оperation control and management of distributed complexes of automatic streaming processing of satellite data // Materials of the fifth international scientific and technical conference "Actual problems of creation of space remote sensing systems of the Earth". Electromechanical matters. VNIIEM studies, 2018. P. 225-234. [11] Kobets D.A., Matveev A.M., Mazurov A.A., Proshin A.A. Organization of automated multithreaded processing of satellite information in remote monitoring systems // Actual Problems of Remote Sensing of the Earth from Space. 2015. Vol.12. No 1. P. 145-155. 69