Towards the Unified Approach for Obtaining Hydro-
Meteorological and Landscape Characteristics for River
Catchments
Dmitriy Abramova, Georgy Ayzela,b and Oleg Nikitinc
a
  Hydrological State Institute, Vasilyevsky Island, 2nd line, 23, St. Petersburg, 199004, Russian Federation
b
  Institute for Environmental Sciences and Geography, University of Potsdam, Potsdam 14476, Germany
c
  Computing Center of the Far Eastern Branch of the Russian Academy of Sciences, 65 Kim Yu Chena Ulitsa,
  Khabarovsk, 680000, Russian Federation


                 Abstract
                 A hydrological catchment is a complex product that is formed and evolves under the
                 interaction of many processes. In general, these processes reflect and could be represented in
                 a set of different geophysical parameters. Modern numerical hydrological models, both
                 physically based and data-driven, benefit from the assimilation of such catchment parameters
                 that allow them a closer representation of river runoff formation processes. However, no
                 readily available tool allows us to obtain the same sets of geophysical parameters for any
                 catchment across the globe. To fill this gap, here we present featureXtractor — an open,
                 unified approach, and reproducible set of scripts for obtaining the large set of catchment
                 properties [1]. It interacts with the open database HydroATLAS and aggregates different sets
                 of hydrological, physiographic, climatic, land cover, soil, and anthropogenic parameters;
                 then, it stores it in a user-defined format. Thus, any catchment across the globe could be
                 represented with a consistent set of descriptors that opens a new way towards large-scale
                 hydrological modeling and applications.

                 Keywords 1
                 Geophysical parameters extraction, HydroATLAS database, open source

1. Introduction
   Hydrological processes are characterized by high spatio-temporal variability. From place to place,
respective directions of transformation of precipitation into runoff occur in different ways. There are
two primary sources of these differences: (1) the various dominant geophysical parameters that
characterize the hydrological catchment and (2) meteorological forcing. Together, these factors
determine the behavior of hydrological catchments in terms of specific runoff formation patterns and
regions [2, 3, 4].
   At the moment, several projects have been focused on collecting and aggregating universal sets of
geophysical parameters and meteorological forcing for advancing large-scale hydrology and the
respective development of hydrological models. Among the most well-known are CAMELS [5],
LAMAH [6], MOPEX[7], CANOPEX [8]. However, each project operates at a specific region, uses a
unique set of input data, thus utilizes different sets of catchment parameters (as well as tools to
acquire them) and meteorological forcing. All these limit projects’ comparability. Thus, after almost
three decades of research in the field of large-scale hydrological modeling, there is no tool for
obtaining a consistent set of catchment parameters that could be particularly beneficial for a research


VI International Conference Information Technologies and High-Performance Computing (ITHPC-2021),
September 14–16, 2021, Khabarovsk, Russia
EMAIL: dmbrmv96@gmail.com (A. 1); hydrogo@yandex.ru (A. 2); olegioner@gmail.com (A. 3);
ORCID: 0000-0003-2682-8722 (A. 1); 0000-0001-5608-9110 (A. 2); 0000-0001-9139-5180 (A. 3);
            ©️ 2021 Copyright for this paper by its authors.
            Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
            CEUR Workshop Proceedings (CEUR-WS.org)


                                                                                   106
community. While obtaining meteorological data is generally solved by using reanalysis data, there is
no such data introduced for obtaining catchment descriptors yet.
   To fill the gap, we propose to use the HydroATLAS [9] database as the latest effort and the most
up-to-date and state-of-the-art compilation of various geophysical datasets at different catchment
levels.
   This paper presents computational workflows and the open-source tool — featureXtractor —
which allows aggregating different sets of geophysical parameters from the HydroATLAS database.
As a case study, we demonstrate an application of the developed tool for deriving catchment
properties for 1018 catchments included in the OpenForecast v2 system [10].

2. Data and Methods

2.1. HydroATLAS
   HydroATLAS was chosen as a source database for several reasons:
   1. First, based on the diversity of data that HydroATLAS database offers for users. There are 56
       environmental variables that are partitioned into 281 individual attributes and organized into 6
       categories: hydrology; physiography; climate; land cover & use; soils & geology; and
       anthropogenic influences.
   2. The second reason is the global availability of data. HydroATLAS derives the hydro-
       environmental characteristics by aggregating and reformatting original data from well-
       established global digital maps, and by accumulating them along the drainage network from
       headwaters to ocean outlets. Hierarchically nested sub-basins are linked to attributes at
       multiple scales, as well as the individual river reaches, both extracted from the global
       HydroSHEDS database [hydrosheds] at 15 arc-second (~500 m) resolution. The sub-basin and
       river reach information is distributed in two companion datasets: BasinATLAS and
       RiverATLAS. The BasinATLAS dataset will be further utilized as a source dataset.
   3. The third reason is the uniformity and consistency of data stored in companion datasets.
       BasinATLAS stores data in shapefiles that correspond to the individual sub-basin number of
       hydro-environmental characteristics. In this way, that allows us to automate the process of
       data extraction and preprocessing.

   Environmental attributes from the HydroATLAS database are stored in six different categories:
hydrology; physiography; climate; land cover & use; soils & geology; and anthropogenic influences.
However, we reduce the number of considered attributes from 281 (in the original dataset) to 149. The
reduction has been determined by the expert screening that defines the suitability of available
characteristics for further use in hydrological modelling studies. The table with original BasinATLAS
and expert-guided variants of datasets alongside auxiliary information is available on GitHub.

2.2. Research Catchments
    For the test case study, we select research catchments of rivers across the Russian Federation with
areas from 50 to 50 000 km². In total, our study includes 1018 catchments from the OpenForecast
system [10]. The boundaries of the respective catchment are stored in shapefiles, allowing us to
manipulate them in a programmatic way using specialized software libraries. Boundaries’ availability
is the sole requirement for the developed computational scripts. Thus, the provided approach for
feature extraction (Figure 1) allows us to prepare the unified and consistent set of catchment
descriptors for any river catchment across the globe with the digitized boundary.

2.3. Computational workflow
   Figure 1 illustrates the general concepts of the proposed computational workflow for deriving the
unified set of catchment descriptors.

                                                   107
Figure 1: The illustration of the proposed computational workflow

   To perform calculations, we need two inputs: (1) a shape boundary of the analyzed catchment and
(2) a pre-downloaded BasinATLAS dataset [11]. Then, the workflow is as follows.
    1. With the usage of Fiona [12] and GeoPandas [13] libraries, we read the shapefiles of
         boundaries and shapefiles of BasinATLAS datasets.
    2. After reading the data, we start to perform intersection procedures based on the use of
         GeoPandas and shapely libraries’ spatial functions. We use the sub-basin layer with the
         highest spatial resolution from the BasinATLAS dataset, where individual basin splits are
         approximately 50 sq. km.
    3. To perform an intersection, there is one more step required. Before assigning sub-basin
         characteristics from BasinATLAS to the targeted basin, it is necessary to ensure that the sub-
         basin intersects enough with it. Thus, we calculate the fraction of the sub-basin and the
         intersection of it with the target basin. If the intersection share is more than 0.2, then the
         considered sub-basin could be included as characterizing.
    4. After the intersection procedure, the next step is calculating aggregated values and splitting
         them to separate datasets based on their affiliation. The attributes, in general, can be divided
         into two types of data: qualitative (land cover, lithological classes) and quantitative (air
         temperature, extents of different characteristics). To aggregate quantitative attributes, a
         weighted mean was used. For the aggregation of the qualitative attributes, we use the spatial
         majority, i.e., we assign the most popular class from sub-basins as the descriptor of the whole
         target catchment.
    5. After aggregation, the individual results are separated into separate sets describing:
         hydrology, physiography, climate, land cover & use, soils & geology, and anthropogenic
         characteristics.
    6. To ensure linear speed up the calculations, the workflow has been parallelized using the
         standard multiprocessing library. The use of 8 threads (CPU: Intel 10700k) achieved a
         computational time of 1 hour for 1018 analyzed catchments.
    7. After the main calculation procedure, the final results could be saved in any user-defined
         format available in standard pandas functionality (e.g., .csv, .tsv, .xls).


                                                    108
   The resulting computational script — featureXtractor — is written in Python programming
language[14] and is entirely based on open and freely distributed software packages: NumPy [15],
pandas [16], geopandas [13] shapely [17]. It is available and ready to use in the GitHub repository [1]
under the MIT license.

3. Results
    The proposed script returns six files according to each category from the BasinATLAS dataset.
Each file represents individual attributes as columns that simplify the further analysis. The first
column of every file represents the unique basin ID. That is the anchor which builds a relation
between files of different categories.
    Figure 2 shows the spatial distribution of the OpenForecast basin dataset and the number of
attributs obtained from the computation.


Figure 2: The distribution of values for characteristics of Groundwater Table Depth, Elevation, Forest
Cover Extent, Permafrost cover extent, mean annual temperature, and annual precipitation. The
histograms indicate the number of catchments (out of 1018) in each bin or category.

   The distribution of the analyzed environmental variables (Figure 2) gives a reliable representation
of features’ spatial heterogeneity across the analyzed catchments. Also, the analyzed features
correspond to specific landscapes and geographic regions. All obtained results and the code for their
analysis and visualization are available in the GitHub repository [1].

4. Conclusion and Outlook
   We introduced a universal tool and unified approach for obtaining an extensive, descriptive, and
consistent set of hydro-climatic and landscape characteristics. The presented tool is a state-of-the-art
and readily available swiss-knife for obtaining the set of catchment attributes for any river catchment
across the globe. This tool was tested using 1018 river catchments on the territory of the Russian
Federation and proved its efficiency for obtaining input data that is usually required for large-scale
hydrological studies. The obtained wide range of geophysical characteristics opens new opportunities
to quantitatively explore how the interplay between topography, climate, land cover, soil, and geology
shapes hydrological behavior. Global coverage of the BasinATLAS dataset and open-source approach
of the presented tool enables a possibility to test any hypothesis about the hydrological system


                                                   109
functioning based on the consistent sample of catchment attributes available for any river catchment
across the globe.
    The field of hydrological modeling benefits from the introduced instrument. Modern data-driven
models for runoff formation could assimilate the representation of catchment attributes while
optimizing their parameters could lead to more reliable results [4]. Also, the vector of catchment
attributes could provide deeper insights into hydrological processes that underlie runoff formation
mechanisms.
    Last but not least. We urge that research reproducibility brings benefits for a broad range of
specialists. Thus, the developed tool makes hard-to-obtain data of catchment attributes easily
accessible yet consistent and reliable. In this way, featureXtractor democratizes research in
hydrological modeling, making one of the research-intensive procedures — data preparation —
available for a broad community that wants to push forward citizen science.

5. Acknowledgements
   The reported study was funded by RFBR, project number 19-35-60005.
   The studies were carried out using the resources of the Center for Shared Use of Scientific
Equipment "Center for Processing and Storage of Scientific Data of the Far Eastern Branch of the
Russian Academy of Sciences", funded by the Russian Federation represented by the Ministry of
Science and Higher Education of the Russian Federation under project No. 075-15-2021-663.

6. References
[1] Dmitriy       A.,    Ayzel    G.,    featureXtractor,   (2021),     GitHub    repository,  URL:
     https://github.com/dmbrmv/featureXtractor
[2] Glushkov V. G. : “Geographic-hydrological method.” Proc. of SHI, No. 57-58 (1933) [in
     Russian].
[3] Grigoriev A.A., Budyko M.I. “On the periodic law of geographic zoning” Reports of the USSR
     Academy of Sciences. 1956. vol. 110. № 1. p. 129–132
[4] Kratzert, F., Daniel K., Guy S., Günter K., Sepp H., and Grey N.. “Towards Learning Universal,
     Regional, and Local Hydrological Behaviors via Machine Learning Applied to Large-Sample
     Datasets.” Hydrology and Earth System Sciences 23, no. 12 (December 17, 2019): 5089–5110.
     https://doi.org/10.5194/hess-23-5089-2019.
[5] Addor, N., Andrew J. N., Naoki M., and Martyn P. C.. “The CAMELS Data Set: Catchment
     Attributes and Meteorology for Large-Sample Studies.” Hydrology and Earth System Sciences
     21, no. 10 (October 20, 2017): 5293–5313. https://doi.org/10.5194/hess-21-5293-2017.
[6] Klingler, C, Karsten S., and Mathew H. “LamaH | LaRge-SamPle DaTa for HYdrology and
     Environmental Sciences for Central Europe.” Earth System Science Data Discussions, March 18,
     2021, 1–46. https://doi.org/10.5194/essd-2021-72.
[7] Duan, Q., J. Schaake, V. Andréassian, S. Franks, G. Goteti, H. V. Gupta, Y. M. Gusev, et al.
     “Model Parameter Estimation Experiment (MOPEX): An Overview of Science Strategy and
     Major Results from the Second and Third Workshops.” Journal of Hydrology, The model
     parameter      estimation experiment,       320, no.       1     (March    30, 2006):     3–17.
     https://doi.org/10.1016/j.jhydrol.2005.07.031.
[8] Arsenault, R., Rachel B., Camille O. D., and François B.. “CANOPEX: A Canadian
     Hydrometeorological Watershed Database.” Hydrological Processes 30, no. 15 (2016): 2734–36.
     https://doi.org/10.1002/hyp.10880.
[9] Linke, S., Bernhard L., Camille O. D., Joseph A., Günther G., Mira A., Penny B., et al. “Global
     Hydro-Environmental Sub-Basin and River Reach Characteristics at High Spatial Resolution.”
     Scientific Data 6, no. 1 (December 9, 2019): 283. https://doi.org/10.1038/s41597-019-0300-6.
[10] Ayzel, G.. “OpenForecast v2: Development and Benchmarking of the First National-Scale
     Operational Runoff Forecasting System in Russia.” Hydrology 8, no. 1 (March 2021): 3.
     https://doi.org/10.3390/hydrology8010003.


                                                 110
[11] Lehner, B.; Linke, S.; Thieme, M. (2019): HydroATLAS version 1.0. figshare. Dataset.
     https://doi.org/10.6084/m9.figshare.9890531.v1
[12] Fiona is GDAL’s neat and nimble vector API for Python programmers, 2021, URL:
     https://pypi.org/project/Fiona/
[13] Kelsey Jordahl, Joris Van den Bossche, Martin Fleischmann, Jacob Wasserman, James McBride,
     Jeffrey Gerard, François Leblanc. (2020, July 15). geopandas/geopandas: v0.8.1 (Version
     v0.8.1). Zenodo. http://doi.org/10.5281/zenodo.3946761
[14] Python Core Team, Python Programming Language, 2021, URL: https://www.python.org/
[15] Harris, C. R., Jarrod M., Stéfan J. van der Walt, Gommers R.,Virtanen P., Cournapeau D.,
     Wieser E., et al. “Array Programming with NumPy.” Nature 585, no. 7825 (September 2020):
     357–62. https://doi.org/10.1038/s41586-020-2649-2.
[16] Reback, J.,McKinney W., Van den Bossche J., Augspurger T., Cloud P., et al. Pandas-
     Dev/Pandas: Pandas 1.0.3. Zenodo, 2020. https://doi.org/10.5281/zenodo.3715232.
[17] Gillies S. and others, toblerity.org, “Shapely: manipulation and analysis of geometric objects”,
     2021, URL: https://github.com/Toblerity/Shapely


                                                  111