Methodology for Evaluation the Effectiveness of the System of
Dynamic Block Access to Data of Ultra-Large Distributed
Remote Sensing Archives
Andrey Proshina, Evgeniy Loupiana and Sergey Bartaleva
a
    Space Research Institute of the Russia Academy of Sciences (IKI), Moscow, Russia


                 Abstract
                 The article analyzes the efficiency of the system of dynamic block access to data of ultra-
                 large distributed ERS archives, implemented according to the technology developed at the
                 Space Research Institute. The relevance of organizing just such an option for providing
                 access to heterogeneous satellite data for their joint processing is considered. It also provides
                 a general summary of the implementation of the dynamic block access system. The system
                 under consideration is intended to prepare the required set of data blocks with given
                 characteristics for their processing on a server cluster. The main factors influencing the
                 efficiency of data block preparation procedures are analyzed. The following describes the
                 methodology developed by the authors for estimating the time spent on preparing a given
                 dataset using a specific hardware infrastructure.

                 Keywords 1
                 Heterogeneous satellite data, satellite data archives, satellite data processing, big data

1. Introduction
    The provision of data in the form of blocks of fixed spatial partitions is widely used in solving a
variety of problems of visualization and data processing by territories. This approach is especially
relevant for processing long time series of heterogeneous Earth remote sensing satellite data over
large areas in order to obtain regular information products that allow analyzing the dynamics of
changes in certain surface characteristics. Such products are currently widely used to solve a wide
range of scientific and applied problems related to the monitoring of the natural environment and
anthropogenic factors. In the framework of the described approach, the processing of satellite data on
the area of interest is carried out in parallel on a set of servers, each of which is provided with a subset
of tiles into which this area is divided. The main advantage of this mechanism is the ability to provide
the required degree of parallelization of processing, which allows the most efficient use of available
computing resources. This approach also makes it possible to effectively implement caching of data
tiles in cases where they can be reused, which can significantly reduce the load on both computing
resources and used network connections. It should be noted that the block approach is currently used
by almost all spatial data visualization systems that provide mass access to such information.
    Wherein, the most common implementation of the block approach at present is the use of specially
organized archives of satellite data, which provide storage of various types of data in a single spatial
partition and projection. However, this option works well enough only in those cases when it is
necessary to operate with a relatively static and homogeneous set of data, for which a single scheme
for organizing the storage and presentation of data can be chosen. At the same time, for the processing
and analysis of data over large areas with the joint use of information from different observation
systems, such an approach in many cases is difficult to implement and impractical. This is primarily

VI International Conference Information Technologies and High-Performance Computing (ITHPC-2021),
September 14–16, 2021, Khabarovsk, Russia
EMAIL: andry@iki.rssi.ru
ORCID: 0000-0003-1470-647X (A. 1); 0000-0001-5943-0695 (A. 2); 0000-0002-4198-4400 (A. 3)
            ©️ 2021 Copyright for this paper by its authors.
            Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
            CEUR Workshop Proceedings (CEUR-WS.org)


                                                                                    62
due to the fact that the optimal schemes for organizing data storage with different resolutions and
observation schemes can differ significantly from each other. The data can be tiled in some regular
way for a given projection or split into regular scenes, e.g. by division of a sensing orbit into equal
sized overlapping fragments. As an illustration, figure 1 provides examples of contours of various
types of satellite data.TERRA/AQUA MODIS data, stored in sinusoidal projection regular granules,
is shown in red. Data from the Russian KMSS instrument installed onboard the Meteor-M series
satellites is shown in green. Data from Landsat-8 satellite stored in the UTM projection is shown in
orange. The depicted heterogeneity of satellite data really sophisticates the development of new data
processing routines, and fusion data processing and analysis becomes even more sophisticated. It is
important to note here that at present, to solve real problems, data from a large number of various
ERS satellite systems are used [1-3] and their diversity only increases from year to year. As a result, it
is almost impossible to develop such a regular partition that would be equally effective for different
types of data and different tasks for their processing and analysis. At the same time, the simultaneous
maintenance of several archives containing the same data in different views leads to very significant
overhead costs associated with data storage and preparation.


Figure 1: Contours of different satellite data storage schemes.

    An alternative to the described implementation is the dynamic formation of data blocks in such a
spatial division and with such characteristics (spatial resolution, projection, set of channels, etc.) that
will be optimal for solving a specific task of processing or visualizing data. This allows use of the
main advantages of the block approach for solving specific data processing problems based on the use
of existing ultra-large satellite data archives. In this case, dynamic block access can be provided both
to data, the storage of which is implemented in the form of certain regular partitions, and to data, the
storage of which is not tied to such partitions. In order to implement the above variant was developed
a technology for dynamic block access to the satellite data archives of IKI-Monitoring Shared Use
Center [4-8]. The distributed archives of this center are built on the basis of using unified satellite data
archive management platform for remote monitoring systems development, implemented at the Space
Research Institute [9].
    The processing of large arrays of remote sensing satellite data, naturally, requires significant
computing resources, and also makes high demands on the network connections used in the process of
its implementation. Therefore, the task of optimizing the procedures for the dynamic provision of data
by blocks is one of the key tasks in the implementation of the above technology. This article is
devoted to a comprehensive analysis of the main factors affecting the performance of a dynamic block
access system to distributed satellite data archives. These factors include the performance of
processing servers and data storage systems, the speed of network connections, the size and type of
satellite data provided, etc. The article presents a technique that allows you to quickly assess the
resources required to provide a given set of data blocks with the desired characteristics. It allows, at
the stage of implementing a new type of processing, to estimate the time spent on providing data


                                                      63
blocks, estimate the resources required for this task and work out an optimal solution based on the real
performance of the existing infrastructure.

2. Methodology for estimating the time spent on preparing data blocks
   Within the framework of the developed technology of dynamic block access to data in satellite
data archives [7,8], data blocks are prepared in parallel on a cluster of specialized servers, on which
direct access to data files in the archive is realized. It is important that such servers must be installed
in each of the information nodes of the distributed archive, serving HTTP requests for receiving data
from the corresponding local volumes of the archive. The dispatch server is responsible for the
formation of the requests themselves, which acts as a smart load balancer. The data block generation
procedure is implemented on the basis of the free software gdal. As part of its execution, only the
necessary fragments of the original data files are read and a file in the GeoTIFF format is formed with
the following specified characteristics: size, projection, resolution, set of channels and compression
algorithm. After that, the file is sent over the network to the server, which acts as a disk buffer, which
provides caching of prepared data blocks and is the source of the initial information for carrying out
one or more processing. To carry out efficient multithreaded processing of satellite data on large
clusters of specialized servers, technologies also developed at the Institute for Space Research are
used [10,11]. In order to estimate the total preparation time of data required for a specific type of
processing, it is necessary to obtain the average execution time of a request to obtain one data block
of each type. Based on these characteristics, an estimate of the total time it will take to prepare the
required dataset using the available infrastructure can be obtained.
   The execution time of a request to prepare a block of data depends on many factors, including the
above characteristics of the files being prepared, as well as the type of initial data, their storage
organization scheme, the speed of access to specific file servers, and the performance of network
links. Establishing all dependencies without building a simplified model takes a lot of time and
resources, which does not allow you to quickly estimate the required time for a new type of
processing. To build such a model, the influence of the most significant of them was analyzed, and a
number of simplified dependencies were established that can be used to estimate the preparation time
for different samples of initial data. In particular, the times of data transmission over the network, as
well as the time required for their compression, do not significantly depend on the type of source data
and the organization of their storage. Also, within the framework of the developed model, the
dependences of the preparation time of data blocks are used not on their size in geographic
coordinates, but on the size in pixels of the image. It is important to note that the simplified model
described below is approximate, and the estimates for the option chosen for its use can be refined if
necessary. The time of data preparation for one channel of the satellite device was chosen as the main
time to be modeled. Below is a formula for estimating this time as the sum of different dependencies
that can be established experimentally:
   T-channel(n, store-type, result-projection, compression-type) =
         T-base(n, store-type, result-projection) +
         T-compression(n, compression-type) +
         T-get(n, compression-type)
Where:
     • n - is the linear size of the received data block in pixels, assuming that the blocks are square,
     • store-type - a set of characteristics of source files, the key of which are their projection and
         the characteristic size of fragments,
     • result-projection - projection of the received data,
     • compression-type - used compression algorithm
   The first term in the T-base formula describes the base (main) time of formation of the data block,
depending on its size, projection and fragmentation of the source data and the projection in which the
results are to be obtained. It is essential that the dependence obtained for a specific type of data can be
applied to other types of satellite data similar in terms of storage organization. The second (optional)
term T-compression corresponds to the time required to compress the received data by block and
depends only on the block size and the compression algorithm used. The third term T-get determines

                                                     64
the time it takes to transfer the resulting file over the network, and depends only on the size of the
data. Further in the article, the main methods for evaluating each of the dependencies considered
above are briefly presented with examples of the dependencies obtained.

2.1.    Experimental derivation of the T-base dependence
    To obtain the basic dependence of the preparation time of data blocks on their size in pixels, it is
necessary to obtain this time for various values of this size. It is important to note that the preparation
time for a data block significantly depends on the speed of access to a particular file storage system,
on which the initial data necessary for its formation are located. The difference in the performance of
different storage systems can be associated with the peculiarities of their hardware implementation,
the degree of workload and the bandwidth of network communication channels. Also, the time of
formation of data blocks may depend on a number of other factors, such as the geographical
coordinates of the requested area and the degree of its fullness with data. Therefore, it is necessary for
each block size to obtain the average time value for all data blocks that must be provided for
processing. Within the framework of the presented methodology, to estimate this time, the average
value is taken over a set of randomly selected data blocks related to the desired sample. At the same
time, for each block size, this sample is generated anew, which avoids caching of reading the original
data files, which can significantly distort the results.
    Figure 2 shows examples of basic dependencies obtained during the formation of blocks according
to the data of the OLI_TIRS instrument installed on the Landsat series satellites for the period from
2013 to 2021 and a range of coordinates approximately corresponding to the territory of Russia.
Dependencies are given for the case with the preservation of the UTM projection and for the case
with the result being obtained in the geographic projection. The conducted studies have shown that
when collecting sufficient statistics, the obtained dependences are well approximated by polynomials
of the second degree, which makes it possible, on the basis of a small number of measurements for
different block sizes in pixels, to accurately estimate the preparation time for any such size. The fact
that when the block size decreases, the preparation time does not tend to zero is associated with the
presence of overhead costs for each launch of the data block generation procedure. It is important to
note that a significant drop in the efficiency of disk read and write operations with a decrease in the
amount of data with which they operate makes a significant contribution to almost all investigated
dependencies.
    In order to evaluate the efficiency of data generation procedures for different block sizes, it is
possible to obtain the dependence of a fixed data area preparation time on the block size in pixels. An
example of such a dependence is shown in Figure 3. It can be seen from it that the specific time
decreases with an increase in the block size, but for large values of this size this effect practically
disappears.

2.2.    Experimental derivation of the T-compression dependence
   The dependence of the compression time on the block size can be obtained by subtracting the
dependence obtained for preparation procedure without compression from the dependence of the
preparation time for blocks using a given compression algorithm. An illustration of this approach is
shown in Figure 4. It shows curves corresponding to the preparation time of data without
compression, using different compression algorithms supported by gdal software, and curves
corresponding to the desired T-compression dependence. For illustration, the fastest LZW
compression algorithm and the slower and more efficient DEFLATE algorithm are selected. The
required dependencies are shown in the graph with dashed lines. Note that the resulting dependences
of the time spent on compressing blocks on their size are also well approximated by polynomials of
the second degree.


                                                     65
Figure 2: Examples of derived base dependencies


Figure 3: Examples of dependencies of fixed data area (million pixels) preparation time on block size


Figure 4: Examples of getting dependencies T-compression

  The degree of data compression using each of the above algorithms was also evaluated. When
compressing blocks that are completely filled with data, the compression percentage for the

                                                   66
DEFLATE algorithm is stable at around 23%, and for the LZW algorithm, the file size even increases
by about 3%. However, the use of the LZW algorithm may still be advisable when a significant part
of the data blocks is not completely filled with data and previous versions of the gdal software are
used, in which the storage of such data is implemented inefficiently.

2.3.    Experimental derivation of the T-get dependence
   Since the download time of data files using the standard HTTP protocol practically does not
depend on their content, we can first establish the dependence of the download time of files on their
size in megabytes, which is typical for the existing infrastructure. The resulting dependence is very
well approximated by a linear function, using which you can plot the dependence of the data transfer
rate on the size of the files. An example of this dependence is shown in the figure 5. As the file size
increases, the data transfer rate tends to the maximum bandwidth of the communication channel.
    Further, Figure 6 shows examples of the required dependences of the download time of data
blocks on their size in pixels using various compression algorithms, which are approximated by
polynomials of the second degree with coefficients calculated on the basis of the established linear
dependence.


Figure 5: An example of the dependence of the file transfer rate on its size


Figure 6: Examples of the obtained dependences of the transmission time of a data block on its size
and compression type


                                                   67
3. Results
    Based on a detailed analysis of the main factors affecting the performance of the system of
dynamic block access to remote sensing archives data, a methodology was developed that allows to
quickly estimate the time cost of preparing the required set of different types of satellite data using a
given set of hardware. In this case, the key elements of the presented methodology are algorithms for
the experimental derivation of a number of basic dependencies. The estimates obtained for one
computational thread allow you to get an approximate time of data preparation for a given number of
computational nodes.
   As mentioned above, the mass processing of satellite data requires very significant computing
resources and time. Therefore, even at the stage of planning a new processing it is necessary to
estimate the necessary computational resources for its implementation in a given time frame. The
presented method allows you to estimate the resources required to prepare a given set of data at the
required speed. In this case, the resulting estimates can be used to select the optimal size of the data
block and their characteristics.

4. Acknowledgements
   The methodology research and development were performed in the frame of “Big data in space
research: astrophysics, Solar system, geosphere” project (state reg. №0024-2019-0014). The
evaluation and testing stages of the studies were carried out using the resources of the Center for
Shared Use of Scientific Equipment "Center for Processing and Storage of Scientific Data of the Far
Eastern Branch of the Russian Academy of Sciences", funded by the Russian Federation represented
by the Ministry of Science and Higher Education of the Russian Federation under project No. 075-15-
2021-663.

5. References
[1] Loupian E.A., Bourtsev M.A., Proshin A.A., Kobets D.A. Evolution of remote monitoring
    information systems development concepts // Actual Problems of Remote Sensing of the Earth
    from Space. 2018. Vol. 15. No. 3. P. 53-66. DOI: DOI: 10.21046/2070-7401-2018-15-3-53-66
[2] Euroconsult, Broshure «Satellites to be built & launched by 2026», 2017, URL:
    http://www.euroconsult-ec.com/research/satellites-built-launched-by-2026-brochure.pdf.
[3] Zhu L. et al. A Review: Remote Sensing Sensors. – IntecOpen, 2018.
[4] Proshin A.A., Bourtsev M.A., Balashov I.V., Loupian E.A., Radchenko M.V., Sychugov
    I.G. “IKI-Monitoring” shared use center support and development — possible solutions //
    Sovremennye problemy distantsionnogo zondirovaniya Zemli iz kosmosa. 2020. Vol. 17. No. 6.
    P. 51-55. DOI: 10.21046/2070-7401-2020-17-6-51-55.
[5] Loupian E.A., Proshin A.A., Bourtsev M.A., Kashnitskiy A.V., Balashov I.V., Bartalev S.A.,
    Konstantinjva A.M., Kobets D.A., Mazurov A.A., Marchenkov V.V., Matveev A.M., Radchenko
    M.V., Sychugov I.G., Tolpin V.A., Uvarov I.A. Experience of development and operation of the
    IKI-Monitoring center for collective use of systems for archiving, processing and analyzing
    satellite data // Actual Problems of Remote Sensing of the Earth from Space. 2019. Vol. 16. No.
    3. P. 151-170. DOI: 10.21046/2070-7401-2019-16-3-151-170.
[6] Loupian E.A., Proshin A.A., Bourtsev M.A., Balashov I.V., Bartalev S.A., Efremov V. Yu.,
    Kashnitskiy A.V., Mazurov A.A., Matveev A.M., Sudneva O.A., Sychugov I.G., Tolpin V.A.,
    Uvarov I.A. IKI center for collective use of satellite data archiving,processing and analysis
    systems aimed at solving the problems of environmental study and monitoring // Actual
    Problems of Remote Sensing of the Earth from Space. 2015. Vol.12. No 5. P. 263-284.
[7] Proshin A.A., Matveev A.M., Kashnitskiy A.V., Bourtsev M.A. Satellite data efficient
    processing with dynamic block archive access // Sovremennye problemy distantsionnogo
    zondirovaniya Zemli iz kosmosa. 2020. Vol. 17. No. 6. P. 56-60. DOI: 10.21046/2070-7401-
    2020-17-6-56-60.


                                                    68
[8] Proshin A.A., Loupian E.A., Balashov I.V., Kashnitskiy A.V., Matveev A.M., Rutkevich
     B.P. Technology of satellite data dynamic block provision to distributed processing systems //
     Actual Problems of Remote Sensing of the Earth from Space. 2020. Vol. 17. No. 7. P. 79-93.
     DOI: 10.21046/2070-7401-2020-17-7-79–93.
[9] Proshin A.A., Loupian E.A., Balashov I.V., Kashnitskiy A.V., Bourtsev M.A. Unified satellite
     data archive management platform for remote monitoring systems development // Actual
     Problems of Remote Sensing of the Earth from Space. 2016. Vol. 13. No. 3. P. 9-27. DOI:
     10.21046/2070-7401-2016-13-3-9-27
[10] Kobets D.A., Matveev A.M., Proshin A.A., Mazurov A.A. Оperation control and management of
     distributed complexes of automatic streaming processing of satellite data // Materials of the fifth
     international scientific and technical conference "Actual problems of creation of space remote
     sensing systems of the Earth". Electromechanical matters. VNIIEM studies, 2018. P. 225-234.
[11] Kobets D.A., Matveev A.M., Mazurov A.A., Proshin A.A. Organization of automated
     multithreaded processing of satellite information in remote monitoring systems // Actual
     Problems of Remote Sensing of the Earth from Space. 2015. Vol.12. No 1. P. 145-155.


                                                    69