Comprehensive analysis of the efficiency of the
system of dynamic block access to data of ultra-large
distributed remote sensing archives
Andrey A. Proshin1 , Evgeniy A. Loupian1
1
    Space Research Institute of the Russia Academy of Sciences, Moscow, Russia


                                         Abstract
                                         The article is devoted to the analysis of the efficiency of the system of dynamic block access to data of
                                         ultra-large distributed ERS archives, implemented according to the technology developed at the Space
                                         Research Institute. The system under consideration is intended to prepare the required set of data
                                         blocks with given characteristics for their processing on a server cluster. Based on the analysis of the
                                         most significant factors affecting the performance of this system, a methodology was developed that
                                         makes it possible to estimate the preparation time of the required data set using the existing hardware
                                         infrastructure.

                                         Keywords
                                         Satellite data archives, satellite data processing, big data.


1. Introduction
Many tasks related to the processing and visualization of spatial data can be most effectively
solved using the block approach, which uses data on tiles, into which the entire area of inter-
est is evenly divided. This approach is especially relevant for processing long time series of
heterogeneous Earth remote sensing satellite data over large areas in order to obtain regular
information products that allow analyzing the dynamics of changes in certain surface charac-
teristics. Such products are currently widely used to solve a wide range of scientific and applied
problems related to the monitoring of the natural environment and anthropogenic factors. In
the framework of the described approach, the processing of satellite data on the area of interest
is carried out in parallel on a set of servers, each of which is provided with a subset of tiles into
which this area is divided. The main advantage of this mechanism is the ability to provide the
required degree of parallelization of processing, which allows the most efficient use of available
computing resources. This approach also makes it possible to effectively implement caching of
data tiles in cases where they can be reused, which can significantly reduce the load on both
computing resources and used network connections. It should be noted that the block approach
is currently used by almost all spatial data visualization systems that provide mass access to
such information.
   The most common implementation of the block approach at present is the use of specially

SDM-2021: All-Russian conference, August 24–27, 2021, Novosibirsk, Russia
" andry@iki.rssi.ru (A. A. Proshin)
                                       © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073       CEUR Workshop Proceedings (CEUR-WS.org)


                                                                                                         222
Andrey A. Proshin et al. CEUR Workshop Proceedings                                        222–230


organized archives of satellite data, which provide storage of various types of data in a single
spatial partition and projection. However, this option works well enough only in those cases
when it is necessary to operate with a relatively static and homogeneous set of data, for which
a single scheme for organizing the storage and presentation of data can be chosen. At the same
time, for the processing and analysis of data over large areas with the joint use of information
from different observation systems, such an approach in many cases is difficult to implement
and impractical. This is primarily due to the fact that the optimal schemes for organizing data
storage with different resolutions and observation schemes can differ significantly from each
other. It is important to note here that at present, to solve real problems, data from a large
number of various ERS satellite systems are used [1, 2, 3] and their diversity only increases
from year to year. As a result, it is almost impossible to develop such a regular partition that
would be equally effective for different types of data and different tasks for their processing
and analysis. At the same time, the simultaneous maintenance of several archives containing
the same data in different views leads to very significant overhead costs associated with data
storage and preparation.
   An alternative to the described implementation is the dynamic formation of data blocks
in such a spatial division and with such characteristics (spatial resolution, projection, set of
channels, etc.) that will be optimal for solving a specific task of processing or visualizing data.
This allows use of the main advantages of the block approach for solving specific data processing
problems based on the use of existing ultra-large satellite data archives. In this case, dynamic
block access can be provided both to data, the storage of which is implemented in the form of
certain regular partitions, and to data, the storage of which is not tied to such partitions. In
order to implement the above variant was developed a technology for dynamic block access to
the satellite data archives of IKI-Monitoring Shared Use Center [4, 5, 6, 7, 8]. The distributed
archives of this center are built on the basis of using unified satellite data archive management
platform for remote monitoring systems development, implemented at the Space Research
Institute [9].
   The processing of large arrays of remote sensing satellite data, naturally, requires significant
computing resources, and also makes high demands on the network connections used in the
process of its implementation. Therefore, the task of optimizing the procedures for the dynamic
provision of data by blocks is one of the key tasks in the implementation of the above technology.
This article is devoted to a comprehensive analysis of the main factors affecting the performance
of a dynamic block access system to distributed satellite data archives. These factors include the
performance of processing servers and data storage systems, the speed of network connections,
the size and type of satellite data provided, etc. The article presents a technique that allows you
to quickly assess the resources required to provide a given set of data blocks with the desired
characteristics. It allows, at the stage of implementing a new type of processing, to estimate the
time spent on providing data blocks, estimate the resources required for this task and work out
an optimal solution based on the real performance of the existing infrastructure.


                                               223
Andrey A. Proshin et al. CEUR Workshop Proceedings                                        222–230


2. Methodology for estimating the time spent on preparing data
   blocks
Within the framework of the developed technology of dynamic block access to data in satellite
data archives [7, 8], data blocks are prepared in parallel on a cluster of specialized servers, on
which direct access to data files in the archive is realized. It is important that such servers
must be installed in each of the information nodes of the distributed archive, serving HTTP
requests for receiving data from the corresponding local volumes of the archive. The dispatch
server is responsible for the formation of the requests themselves, which acts as a smart load
balancer. The data block generation procedure is implemented on the basis of the free software
gdal. As part of its execution, only the necessary fragments of the original data files are read
and a file in the GeoTIFF format is formed with the following specified characteristics: size,
projection, resolution, set of channels and compression algorithm. After that, the file is sent
over the network to the server, which acts as a disk buffer, which provides caching of prepared
data blocks and is the source of the initial information for carrying out one or more processing.
To carry out efficient multithreaded processing of satellite data on large clusters of specialized
servers, technologies also developed at the Institute for Space Research are used [10, 11]. In
order to estimate the total preparation time of data required for a specific type of processing, it
is necessary to obtain the average execution time of a request to obtain one data block of each
type. Based on these characteristics, an estimate of the total time it will take to prepare the
required dataset using the available infrastructure can be obtained.
   The execution time of a request to prepare a block of data depends on many factors, including
the above characteristics of the files being prepared, as well as the type of initial data, their
storage organization scheme, the speed of access to specific file servers, and the performance of
network links. Establishing all dependencies without building a simplified model takes a lot
of time and resources, which does not allow you to quickly estimate the required time for a
new type of processing. To build such a model, the influence of the most significant of them
was analyzed, and a number of simplified dependencies were established that can be used to
estimate the preparation time for different samples of initial data. In particular, the times of
data transmission over the network, as well as the time required for their compression, do not
significantly depend on the type of source data and the organization of their storage. Also,
within the framework of the developed model, the dependences of the preparation time of data
blocks are used not on their size in geographic coordinates, but on the size in pixels of the
image. It is important to note that the simplified model described below is approximate, and
the estimates for the option chosen for its use can be refined if necessary. The time of data
preparation for one channel of the satellite device was chosen as the main time to be modeled.
Below is a formula for estimating this time as the sum of different dependencies that can be
established experimentally:

               T-channel(𝑛, store-type, result-projection, compression-type) =
                          T-base(𝑛, store-type, result-projection)+
                          T-compression(𝑛, compression-type)+
                          T-get(𝑛, compression-type),


                                               224
Andrey A. Proshin et al. CEUR Workshop Proceedings                                           222–230


where: 𝑛 is the linear size of the received data block in pixels, assuming that the blocks are
square; store-type — a set of characteristics of source files, the key of which are their projection
and the characteristic size of fragments; result-projection — projection of the received data;
compression-type — used compression algorithm.
   The first term in the T-base formula describes the base (main) time of formation of the data
block, depending on its size, projection and fragmentation of the source data and the projection
in which the results are to be obtained. It is essential that the dependence obtained for a specific
type of data can be applied to other types of satellite data similar in terms of storage organization.
The second (optional) term T-compression corresponds to the time required to compress the
received data by block and depends only on the block size and the compression algorithm
used. The third term T-get determines the time it takes to transfer the resulting file over the
network, and depends only on the size of the data. Further in the article, the main methods for
evaluating each of the dependencies considered above are briefly presented with examples of
the dependencies obtained.

2.1. Experimental derivation of the T-base dependence
To obtain the basic dependence of the preparation time of data blocks on their size in pixels, it
is necessary to obtain this time for various values of this size. It is important to note that the
preparation time for a data block significantly depends on the speed of access to a particular file
storage system, on which the initial data necessary for its formation are located. The difference
in the performance of different storage systems can be associated with the peculiarities of their
hardware implementation, the degree of workload and the bandwidth of network communication
channels. Also, the time of formation of data blocks may depend on a number of other factors,
such as the geographical coordinates of the requested area and the degree of its fullness with
data. Therefore, it is necessary for each block size to obtain the average time value for all
data blocks that must be provided for processing. Within the framework of the presented
methodology, to estimate this time, the average value is taken over a set of randomly selected
data blocks related to the desired sample. At the same time, for each block size, this sample is
generated anew, which avoids caching of reading the original data files, which can significantly
distort the results.
   Figure 1 shows examples of basic dependencies obtained during the formation of blocks
according to the data of the OLI_TIRS instrument installed on the Landsat series satellites for the
period from 2013 to 2021 and a range of coordinates approximately corresponding to the territory
of Russia. Dependencies are given for the case with the preservation of the UTM projection
and for the case with the result being obtained in the geographic projection. The conducted
studies have shown that when collecting sufficient statistics, the obtained dependences are well
approximated by polynomials of the second degree, which makes it possible, on the basis of a
small number of measurements for different block sizes in pixels, to accurately estimate the
preparation time for any such size. The fact that when the block size decreases, the preparation
time does not tend to zero is associated with the presence of overhead costs for each launch
of the data block generation procedure. It is important to note that a significant drop in the
efficiency of disk read and write operations with a decrease in the amount of data with which
they operate makes a significant contribution to almost all investigated dependencies.


                                                 225
Andrey A. Proshin et al. CEUR Workshop Proceedings                                            222–230


Figure 1: Examples of derived base dependencies.


Figure 2: Examples of dependencies of fixed data area (million pixels) preparation time on block size.


   In order to evaluate the efficiency of data generation procedures for different block sizes, it
is possible to obtain the dependence of a fixed data area preparation time on the block size in
pixels. An example of such a dependence is shown in Figure 2. It can be seen from it that the
specific time decreases with an increase in the block size, but for large values of this size this
effect practically disappears.

2.2. Experimental derivation of the T-compression dependence
The dependence of the compression time on the block size can be obtained by subtracting the
dependence obtained for preparation procedure without compression from the dependence of
the preparation time for blocks using a given compression algorithm. An illustration of this
approach is shown in Figure 3. It shows curves corresponding to the preparation time of data
without compression, using different compression algorithms supported by gdal software, and


                                                 226
Andrey A. Proshin et al. CEUR Workshop Proceedings                                        222–230


Figure 3: Examples of getting dependencies T-compression.


curves corresponding to the desired T-compression dependence. For illustration, the fastest
LZW compression algorithm and the slower and more efficient DEFLATE algorithm are selected.
The required dependencies are shown in the graph with dashed lines. Note that the resulting
dependences of the time spent on compressing blocks on their size are also well approximated
by polynomials of the second degree.
   The degree of data compression using each of the above algorithms was also evaluated. When
compressing blocks that are completely filled with data, the compression percentage for the
DEFLATE algorithm is stable at around 23%, and for the LZW algorithm, the file size even
increases by about 3%. However, the use of the LZW algorithm may still be advisable when a
significant part of the data blocks is not completely filled with data and previous versions of the
gdal software are used, in which the storage of such data is implemented inefficiently.

2.3. Experimental derivation of the T-get dependence
Since the download time of data files using the standard HTTP protocol practically does not
depend on their content, we can first establish the dependence of the download time of files on
their size in megabytes, which is typical for the existing infrastructure. An example of such a
dependence, well approximated by a linear function, is shown in Figure 4. The following graph
shown in Figure 5 shows the corresponding dependence of the file transfer rate on its size. This
speed tends to the maximum for the used communication channel as the file size grows.
   Further, Figure 6 shows examples of the required dependences of the download time of data
blocks on their size in pixels using various compression algorithms, which are approximated by
polynomials of the second degree with coefficients calculated on the basis of the established
linear dependence.


                                               227
Andrey A. Proshin et al. CEUR Workshop Proceedings                                         222–230


Figure 4: Example of dependence of file transfer time on file size.


Figure 5: An example of the dependence of the file transfer rate on its size.


Figure 6: Examples of the obtained dependences of the transmission time of a data block on its size
and compression type.


                                                  228
Andrey A. Proshin et al. CEUR Workshop Proceedings                                         222–230


3. Conclusions
The article presents the main results of a comprehensive analysis of the effectiveness of a system
of dynamic block access to data of ultra-large distributed archives of satellite data for processing
them on a cluster of servers. Based on the analysis of the main factors affecting the efficiency of
the mechanism for providing data blocks, a method was developed that allows you to quickly
estimate the time spent on preparing a given set of initial data. The developed technique can
also be used to select the optimal size of the data block and other characteristics of the data
provided.


Acknowledgments
The work is performed in the frame of “Big data in space research: astrophysics, Solar system,
geosphere” project (state reg. No. 0024-2019-0014).


References
 [1] Loupian E.A., Bourtsev M.A., Proshin A.A., Kobets D.A. Evolution of remote monitoring
     information systems development concepts // Actual Problems of Remote Sensing of the
     Earth from Space. 2018. Vol. 15. No. 3. P. 53–66. DOI:10.21046/2070-7401-2018-15-3-53-66.
 [2] Euroconsult, Broshure “Satellites to be built & launched by 2026”. 2017. Available at: http:
     //www.euroconsult-ec.com/research/satellites-built-launched-by-2026-brochure.pdf.
 [3] Zhu L. et al. A review: Remote sensing sensors. IntecOpen, 2018.
 [4] Proshin A.A., Bourtsev M.A., Balashov I.V., Loupian E.A., Radchenko M.V., Sychugov I.G.
     “IKI-Monitoring” shared use center support and development — Possible solutions // Sovre-
     mennye Problemy Distantsionnogo Zondirovaniya Zemli iz Kosmosa. 2020. Vol. 17. No. 6.
     P. 51–55. DOI:10.21046/2070-7401-2020-17-6-51-55.
 [5] Loupian E.A., Proshin A.A., Bourtsev M.A., Kashnitskiy A.V., Balashov I.V., Bartalev S.A.,
     Konstantinjva A.M., Kobets D.A., Mazurov A.A., Marchenkov V.V., Matveev A.M., Rad-
     chenko M.V., Sychugov I.G., Tolpin V.A., Uvarov I.A. Experience of development and
     operation of the IKI-Monitoring center for collective use of systems for archiving, process-
     ing and analyzing satellite data // Actual Problems of Remote Sensing of the Earth from
     Space. 2019. Vol. 16. No. 3. P. 151–170. DOI:10.21046/2070-7401-2019-16-3-151-170.
 [6] Loupian E.A., Proshin A.A., Bourtsev M.A., Balashov I.V., Bartalev S.A., Efremov V.Yu.,
     Kashnitskiy A.V., Mazurov A.A., Matveev A.M., Sudneva O.A., Sychugov I.G., Tolpin V.A.,
     Uvarov I.A. IKI center for collective use of satellite data archiving,processing and analysis
     systems aimed at solving the problems of environmental study and monitoring // Actual
     Problems of Remote Sensing of the Earth from Space. 2015. Vol. 12. No. 5. P. 263–284.
 [7] Proshin A.A., Matveev A.M., Kashnitskiy A.V., Bourtsev M.A. Satellite data efficient pro-
     cessing with dynamic block archive access // Sovremennye Problemy Distantsionnogo
     Zondirovaniya Zemli iz Kosmosa. 2020. Vol. 17. No. 6. P. 56–60. DOI:10.21046/2070-7401-
     2020-17-6-56-60.
 [8] Proshin A.A., Loupian E.A., Balashov I.V., Kashnitskiy A.V., Matveev A.M., Rutkevich B.P.


                                                229
Andrey A. Proshin et al. CEUR Workshop Proceedings                                     222–230


     Technology of satellite data dynamic block provision to distributed processing systems //
     Actual Problems of Remote Sensing of the Earth from Space. 2020. Vol. 17. No. 7. P. 79-93.
     DOI:10.21046/2070-7401-2020-17-7-79-93.
 [9] Proshin A.A., Loupian E.A., Balashov I.V., Kashnitskiy A.V., Bourtsev M.A. Unified satel-
     lite data archive management platform for remote monitoring systems development //
     Actual Problems of Remote Sensing of the Earth from Space. 2016. Vol. 13. No. 3. P. 9–27.
     DOI:10.21046/2070-7401-2016-13-3-9-27.
[10] Kobets D.A., Matveev A.M., Proshin A.A., Mazurov A.A. Operation control and man-
     agement of distributed complexes of automatic streaming processing of satellite data //
     Materials of the Fifth International Scientific and Technical Conference “Actual Problems
     of Creation of Space Remote Sensing Systems of the Earth”. Electromechanical Matters.
     VNIIEM Studies, 2018. P. 225–234.
[11] Kobets D.A., Matveev A.M., Mazurov A.A., Proshin A.A. Organization of automated
     multithreaded processing of satellite information in remote monitoring systems // Actual
     Problems of Remote Sensing of the Earth from Space. 2015. Vol. 12. No. 1. P. 145–155.


                                               230