Comprehensive analysis of the efficiency of the system of dynamic block access to data of ultra-large distributed remote sensing archives Andrey A. Proshin1 , Evgeniy A. Loupian1 1 Space Research Institute of the Russia Academy of Sciences, Moscow, Russia Abstract The article is devoted to the analysis of the efficiency of the system of dynamic block access to data of ultra-large distributed ERS archives, implemented according to the technology developed at the Space Research Institute. The system under consideration is intended to prepare the required set of data blocks with given characteristics for their processing on a server cluster. Based on the analysis of the most significant factors affecting the performance of this system, a methodology was developed that makes it possible to estimate the preparation time of the required data set using the existing hardware infrastructure. Keywords Satellite data archives, satellite data processing, big data. 1. Introduction Many tasks related to the processing and visualization of spatial data can be most effectively solved using the block approach, which uses data on tiles, into which the entire area of inter- est is evenly divided. This approach is especially relevant for processing long time series of heterogeneous Earth remote sensing satellite data over large areas in order to obtain regular information products that allow analyzing the dynamics of changes in certain surface charac- teristics. Such products are currently widely used to solve a wide range of scientific and applied problems related to the monitoring of the natural environment and anthropogenic factors. In the framework of the described approach, the processing of satellite data on the area of interest is carried out in parallel on a set of servers, each of which is provided with a subset of tiles into which this area is divided. The main advantage of this mechanism is the ability to provide the required degree of parallelization of processing, which allows the most efficient use of available computing resources. This approach also makes it possible to effectively implement caching of data tiles in cases where they can be reused, which can significantly reduce the load on both computing resources and used network connections. It should be noted that the block approach is currently used by almost all spatial data visualization systems that provide mass access to such information. The most common implementation of the block approach at present is the use of specially SDM-2021: All-Russian conference, August 24–27, 2021, Novosibirsk, Russia " andry@iki.rssi.ru (A. A. Proshin) © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) 222 Andrey A. Proshin et al. CEUR Workshop Proceedings 222–230 organized archives of satellite data, which provide storage of various types of data in a single spatial partition and projection. However, this option works well enough only in those cases when it is necessary to operate with a relatively static and homogeneous set of data, for which a single scheme for organizing the storage and presentation of data can be chosen. At the same time, for the processing and analysis of data over large areas with the joint use of information from different observation systems, such an approach in many cases is difficult to implement and impractical. This is primarily due to the fact that the optimal schemes for organizing data storage with different resolutions and observation schemes can differ significantly from each other. It is important to note here that at present, to solve real problems, data from a large number of various ERS satellite systems are used [1, 2, 3] and their diversity only increases from year to year. As a result, it is almost impossible to develop such a regular partition that would be equally effective for different types of data and different tasks for their processing and analysis. At the same time, the simultaneous maintenance of several archives containing the same data in different views leads to very significant overhead costs associated with data storage and preparation. An alternative to the described implementation is the dynamic formation of data blocks in such a spatial division and with such characteristics (spatial resolution, projection, set of channels, etc.) that will be optimal for solving a specific task of processing or visualizing data. This allows use of the main advantages of the block approach for solving specific data processing problems based on the use of existing ultra-large satellite data archives. In this case, dynamic block access can be provided both to data, the storage of which is implemented in the form of certain regular partitions, and to data, the storage of which is not tied to such partitions. In order to implement the above variant was developed a technology for dynamic block access to the satellite data archives of IKI-Monitoring Shared Use Center [4, 5, 6, 7, 8]. The distributed archives of this center are built on the basis of using unified satellite data archive management platform for remote monitoring systems development, implemented at the Space Research Institute [9]. The processing of large arrays of remote sensing satellite data, naturally, requires significant computing resources, and also makes high demands on the network connections used in the process of its implementation. Therefore, the task of optimizing the procedures for the dynamic provision of data by blocks is one of the key tasks in the implementation of the above technology. This article is devoted to a comprehensive analysis of the main factors affecting the performance of a dynamic block access system to distributed satellite data archives. These factors include the performance of processing servers and data storage systems, the speed of network connections, the size and type of satellite data provided, etc. The article presents a technique that allows you to quickly assess the resources required to provide a given set of data blocks with the desired characteristics. It allows, at the stage of implementing a new type of processing, to estimate the time spent on providing data blocks, estimate the resources required for this task and work out an optimal solution based on the real performance of the existing infrastructure. 223 Andrey A. Proshin et al. CEUR Workshop Proceedings 222–230 2. Methodology for estimating the time spent on preparing data blocks Within the framework of the developed technology of dynamic block access to data in satellite data archives [7, 8], data blocks are prepared in parallel on a cluster of specialized servers, on which direct access to data files in the archive is realized. It is important that such servers must be installed in each of the information nodes of the distributed archive, serving HTTP requests for receiving data from the corresponding local volumes of the archive. The dispatch server is responsible for the formation of the requests themselves, which acts as a smart load balancer. The data block generation procedure is implemented on the basis of the free software gdal. As part of its execution, only the necessary fragments of the original data files are read and a file in the GeoTIFF format is formed with the following specified characteristics: size, projection, resolution, set of channels and compression algorithm. After that, the file is sent over the network to the server, which acts as a disk buffer, which provides caching of prepared data blocks and is the source of the initial information for carrying out one or more processing. To carry out efficient multithreaded processing of satellite data on large clusters of specialized servers, technologies also developed at the Institute for Space Research are used [10, 11]. In order to estimate the total preparation time of data required for a specific type of processing, it is necessary to obtain the average execution time of a request to obtain one data block of each type. Based on these characteristics, an estimate of the total time it will take to prepare the required dataset using the available infrastructure can be obtained. The execution time of a request to prepare a block of data depends on many factors, including the above characteristics of the files being prepared, as well as the type of initial data, their storage organization scheme, the speed of access to specific file servers, and the performance of network links. Establishing all dependencies without building a simplified model takes a lot of time and resources, which does not allow you to quickly estimate the required time for a new type of processing. To build such a model, the influence of the most significant of them was analyzed, and a number of simplified dependencies were established that can be used to estimate the preparation time for different samples of initial data. In particular, the times of data transmission over the network, as well as the time required for their compression, do not significantly depend on the type of source data and the organization of their storage. Also, within the framework of the developed model, the dependences of the preparation time of data blocks are used not on their size in geographic coordinates, but on the size in pixels of the image. It is important to note that the simplified model described below is approximate, and the estimates for the option chosen for its use can be refined if necessary. The time of data preparation for one channel of the satellite device was chosen as the main time to be modeled. Below is a formula for estimating this time as the sum of different dependencies that can be established experimentally: T-channel(𝑛, store-type, result-projection, compression-type) = T-base(𝑛, store-type, result-projection)+ T-compression(𝑛, compression-type)+ T-get(𝑛, compression-type), 224 Andrey A. Proshin et al. CEUR Workshop Proceedings 222–230 where: 𝑛 is the linear size of the received data block in pixels, assuming that the blocks are square; store-type — a set of characteristics of source files, the key of which are their projection and the characteristic size of fragments; result-projection — projection of the received data; compression-type — used compression algorithm. The first term in the T-base formula describes the base (main) time of formation of the data block, depending on its size, projection and fragmentation of the source data and the projection in which the results are to be obtained. It is essential that the dependence obtained for a specific type of data can be applied to other types of satellite data similar in terms of storage organization. The second (optional) term T-compression corresponds to the time required to compress the received data by block and depends only on the block size and the compression algorithm used. The third term T-get determines the time it takes to transfer the resulting file over the network, and depends only on the size of the data. Further in the article, the main methods for evaluating each of the dependencies considered above are briefly presented with examples of the dependencies obtained. 2.1. Experimental derivation of the T-base dependence To obtain the basic dependence of the preparation time of data blocks on their size in pixels, it is necessary to obtain this time for various values of this size. It is important to note that the preparation time for a data block significantly depends on the speed of access to a particular file storage system, on which the initial data necessary for its formation are located. The difference in the performance of different storage systems can be associated with the peculiarities of their hardware implementation, the degree of workload and the bandwidth of network communication channels. Also, the time of formation of data blocks may depend on a number of other factors, such as the geographical coordinates of the requested area and the degree of its fullness with data. Therefore, it is necessary for each block size to obtain the average time value for all data blocks that must be provided for processing. Within the framework of the presented methodology, to estimate this time, the average value is taken over a set of randomly selected data blocks related to the desired sample. At the same time, for each block size, this sample is generated anew, which avoids caching of reading the original data files, which can significantly distort the results. Figure 1 shows examples of basic dependencies obtained during the formation of blocks according to the data of the OLI_TIRS instrument installed on the Landsat series satellites for the period from 2013 to 2021 and a range of coordinates approximately corresponding to the territory of Russia. Dependencies are given for the case with the preservation of the UTM projection and for the case with the result being obtained in the geographic projection. The conducted studies have shown that when collecting sufficient statistics, the obtained dependences are well approximated by polynomials of the second degree, which makes it possible, on the basis of a small number of measurements for different block sizes in pixels, to accurately estimate the preparation time for any such size. The fact that when the block size decreases, the preparation time does not tend to zero is associated with the presence of overhead costs for each launch of the data block generation procedure. It is important to note that a significant drop in the efficiency of disk read and write operations with a decrease in the amount of data with which they operate makes a significant contribution to almost all investigated dependencies. 225 Andrey A. Proshin et al. CEUR Workshop Proceedings 222–230 Figure 1: Examples of derived base dependencies. Figure 2: Examples of dependencies of fixed data area (million pixels) preparation time on block size. In order to evaluate the efficiency of data generation procedures for different block sizes, it is possible to obtain the dependence of a fixed data area preparation time on the block size in pixels. An example of such a dependence is shown in Figure 2. It can be seen from it that the specific time decreases with an increase in the block size, but for large values of this size this effect practically disappears. 2.2. Experimental derivation of the T-compression dependence The dependence of the compression time on the block size can be obtained by subtracting the dependence obtained for preparation procedure without compression from the dependence of the preparation time for blocks using a given compression algorithm. An illustration of this approach is shown in Figure 3. It shows curves corresponding to the preparation time of data without compression, using different compression algorithms supported by gdal software, and 226 Andrey A. Proshin et al. CEUR Workshop Proceedings 222–230 Figure 3: Examples of getting dependencies T-compression. curves corresponding to the desired T-compression dependence. For illustration, the fastest LZW compression algorithm and the slower and more efficient DEFLATE algorithm are selected. The required dependencies are shown in the graph with dashed lines. Note that the resulting dependences of the time spent on compressing blocks on their size are also well approximated by polynomials of the second degree. The degree of data compression using each of the above algorithms was also evaluated. When compressing blocks that are completely filled with data, the compression percentage for the DEFLATE algorithm is stable at around 23%, and for the LZW algorithm, the file size even increases by about 3%. However, the use of the LZW algorithm may still be advisable when a significant part of the data blocks is not completely filled with data and previous versions of the gdal software are used, in which the storage of such data is implemented inefficiently. 2.3. Experimental derivation of the T-get dependence Since the download time of data files using the standard HTTP protocol practically does not depend on their content, we can first establish the dependence of the download time of files on their size in megabytes, which is typical for the existing infrastructure. An example of such a dependence, well approximated by a linear function, is shown in Figure 4. The following graph shown in Figure 5 shows the corresponding dependence of the file transfer rate on its size. This speed tends to the maximum for the used communication channel as the file size grows. Further, Figure 6 shows examples of the required dependences of the download time of data blocks on their size in pixels using various compression algorithms, which are approximated by polynomials of the second degree with coefficients calculated on the basis of the established linear dependence. 227 Andrey A. Proshin et al. CEUR Workshop Proceedings 222–230 Figure 4: Example of dependence of file transfer time on file size. Figure 5: An example of the dependence of the file transfer rate on its size. Figure 6: Examples of the obtained dependences of the transmission time of a data block on its size and compression type. 228 Andrey A. Proshin et al. CEUR Workshop Proceedings 222–230 3. Conclusions The article presents the main results of a comprehensive analysis of the effectiveness of a system of dynamic block access to data of ultra-large distributed archives of satellite data for processing them on a cluster of servers. Based on the analysis of the main factors affecting the efficiency of the mechanism for providing data blocks, a method was developed that allows you to quickly estimate the time spent on preparing a given set of initial data. The developed technique can also be used to select the optimal size of the data block and other characteristics of the data provided. Acknowledgments The work is performed in the frame of “Big data in space research: astrophysics, Solar system, geosphere” project (state reg. No. 0024-2019-0014). References [1] Loupian E.A., Bourtsev M.A., Proshin A.A., Kobets D.A. Evolution of remote monitoring information systems development concepts // Actual Problems of Remote Sensing of the Earth from Space. 2018. Vol. 15. No. 3. P. 53–66. DOI:10.21046/2070-7401-2018-15-3-53-66. [2] Euroconsult, Broshure “Satellites to be built & launched by 2026”. 2017. Available at: http: //www.euroconsult-ec.com/research/satellites-built-launched-by-2026-brochure.pdf. [3] Zhu L. et al. A review: Remote sensing sensors. IntecOpen, 2018. [4] Proshin A.A., Bourtsev M.A., Balashov I.V., Loupian E.A., Radchenko M.V., Sychugov I.G. “IKI-Monitoring” shared use center support and development — Possible solutions // Sovre- mennye Problemy Distantsionnogo Zondirovaniya Zemli iz Kosmosa. 2020. Vol. 17. No. 6. P. 51–55. DOI:10.21046/2070-7401-2020-17-6-51-55. [5] Loupian E.A., Proshin A.A., Bourtsev M.A., Kashnitskiy A.V., Balashov I.V., Bartalev S.A., Konstantinjva A.M., Kobets D.A., Mazurov A.A., Marchenkov V.V., Matveev A.M., Rad- chenko M.V., Sychugov I.G., Tolpin V.A., Uvarov I.A. Experience of development and operation of the IKI-Monitoring center for collective use of systems for archiving, process- ing and analyzing satellite data // Actual Problems of Remote Sensing of the Earth from Space. 2019. Vol. 16. No. 3. P. 151–170. DOI:10.21046/2070-7401-2019-16-3-151-170. [6] Loupian E.A., Proshin A.A., Bourtsev M.A., Balashov I.V., Bartalev S.A., Efremov V.Yu., Kashnitskiy A.V., Mazurov A.A., Matveev A.M., Sudneva O.A., Sychugov I.G., Tolpin V.A., Uvarov I.A. IKI center for collective use of satellite data archiving,processing and analysis systems aimed at solving the problems of environmental study and monitoring // Actual Problems of Remote Sensing of the Earth from Space. 2015. Vol. 12. No. 5. P. 263–284. [7] Proshin A.A., Matveev A.M., Kashnitskiy A.V., Bourtsev M.A. Satellite data efficient pro- cessing with dynamic block archive access // Sovremennye Problemy Distantsionnogo Zondirovaniya Zemli iz Kosmosa. 2020. Vol. 17. No. 6. P. 56–60. DOI:10.21046/2070-7401- 2020-17-6-56-60. [8] Proshin A.A., Loupian E.A., Balashov I.V., Kashnitskiy A.V., Matveev A.M., Rutkevich B.P. 229 Andrey A. Proshin et al. CEUR Workshop Proceedings 222–230 Technology of satellite data dynamic block provision to distributed processing systems // Actual Problems of Remote Sensing of the Earth from Space. 2020. Vol. 17. No. 7. P. 79-93. DOI:10.21046/2070-7401-2020-17-7-79-93. [9] Proshin A.A., Loupian E.A., Balashov I.V., Kashnitskiy A.V., Bourtsev M.A. Unified satel- lite data archive management platform for remote monitoring systems development // Actual Problems of Remote Sensing of the Earth from Space. 2016. Vol. 13. No. 3. P. 9–27. DOI:10.21046/2070-7401-2016-13-3-9-27. [10] Kobets D.A., Matveev A.M., Proshin A.A., Mazurov A.A. Operation control and man- agement of distributed complexes of automatic streaming processing of satellite data // Materials of the Fifth International Scientific and Technical Conference “Actual Problems of Creation of Space Remote Sensing Systems of the Earth”. Electromechanical Matters. VNIIEM Studies, 2018. P. 225–234. [11] Kobets D.A., Matveev A.M., Mazurov A.A., Proshin A.A. Organization of automated multithreaded processing of satellite information in remote monitoring systems // Actual Problems of Remote Sensing of the Earth from Space. 2015. Vol. 12. No. 1. P. 145–155. 230