<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>and Landscape Characteristics for Hydro- River</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Dmitriy Abramov</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Georgy Ayzel</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Oleg Nikitin</string-name>
          <email>olegioner@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Computing Center of the Far Eastern Branch of the Russian Academy of Sciences</institution>
          ,
          <addr-line>65 Kim Yu Chena Ulitsa</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Hydrological State Institute</institution>
          ,
          <addr-line>Vasilyevsky Island, 2nd line, 23, St. Petersburg, 199004, Russian Federation</addr-line>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Institute for Environmental Sciences and Geography, University of Potsdam</institution>
          ,
          <addr-line>Potsdam 14476</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>Khabarovsk</institution>
          ,
          <addr-line>680000, Russian Federation</addr-line>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2021</year>
      </pub-date>
      <fpage>14</fpage>
      <lpage>16</lpage>
      <abstract>
        <p>A hydrological catchment is a complex product that is formed and evolves under the interaction of many processes. In general, these processes reflect and could be represented in a set of different geophysical parameters. Modern numerical hydrological models, both physically based and data-driven, benefit from the assimilation of such catchment parameters that allow them a closer representation of river runoff formation processes. However, no readily available tool allows us to obtain the same sets of geophysical parameters for any catchment across the globe. To fill this gap, here we present featureXtractor - an open, unified approach, and reproducible set of scripts for obtaining the large set of catchment properties [1]. It interacts with the open database HydroATLAS and aggregates different sets of hydrological, physiographic, climatic, land cover, soil, and anthropogenic parameters; then, it stores it in a user-defined format. Thus, any catchment across the globe could be represented with a consistent set of descriptors that opens a new way towards large-scale hydrological modeling and applications. Geophysical parameters extraction, HydroATLAS database, open source VI International Conference Information Technologies and High-Performance Computing (ITHPC-2021),</p>
      </abstract>
      <kwd-group>
        <kwd>Characteristics</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Hydrological processes are characterized by high spatio-temporal variability. From place to place,
respective directions of transformation of precipitation into runoff occur in different ways. There are
two primary sources of these differences: (1) the various dominant geophysical parameters that
characterize the hydrological catchment and (2) meteorological forcing. Together, these factors
determine the behavior of hydrological catchments in terms of specific runoff formation patterns and
regions [
        <xref ref-type="bibr" rid="ref2 ref3 ref4">2, 3, 4</xref>
        ].
      </p>
      <p>
        At the moment, several projects have been focused on collecting and aggregating universal sets of
geophysical parameters and meteorological forcing for advancing large-scale hydrology and the
respective development of hydrological models. Among the most well-known are CAMELS [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ],
LAMAH [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], MOPEX[
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], CANOPEX [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. However, each project operates at a specific region, uses a
unique set of input data, thus utilizes different sets of catchment parameters (as well as tools to
acquire them) and meteorological forcing. All these limit projects’ comparability. Thus, after almost
three decades of research in the field of large-scale hydrological modeling, there is no tool for
obtaining a consistent set of catchment parameters that could be particularly beneficial for a research
      </p>
      <p>2021 Copyright for this paper by its authors.
community. While obtaining meteorological data is generally solved by using reanalysis data, there is
no such data introduced for obtaining catchment descriptors yet.</p>
      <p>
        To fill the gap, we propose to use the HydroATLAS [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] database as the latest effort and the most
up-to-date and state-of-the-art compilation of various geophysical datasets at different catchment
levels.
      </p>
      <p>
        This paper presents computational workflows and the open-source tool — featureXtractor —
which allows aggregating different sets of geophysical parameters from the HydroATLAS database.
As a case study, we demonstrate an application of the developed tool for deriving catchment
properties for 1018 catchments included in the OpenForecast v2 system [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ].
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. Data and Methods 2.1.</title>
    </sec>
    <sec id="sec-3">
      <title>HydroATLAS</title>
      <p>HydroATLAS was chosen as a source database for several reasons:
1. First, based on the diversity of data that HydroATLAS database offers for users. There are 56
environmental variables that are partitioned into 281 individual attributes and organized into 6
categories: hydrology; physiography; climate; land cover &amp; use; soils &amp; geology; and
anthropogenic influences.
2. The second reason is the global availability of data. HydroATLAS derives the
hydroenvironmental characteristics by aggregating and reformatting original data from
wellestablished global digital maps, and by accumulating them along the drainage network from
headwaters to ocean outlets. Hierarchically nested sub-basins are linked to attributes at
multiple scales, as well as the individual river reaches, both extracted from the global
HydroSHEDS database [hydrosheds] at 15 arc-second (~500 m) resolution. The sub-basin and
river reach information is distributed in two companion datasets: BasinATLAS and
RiverATLAS. The BasinATLAS dataset will be further utilized as a source dataset.
3. The third reason is the uniformity and consistency of data stored in companion datasets.</p>
      <p>BasinATLAS stores data in shapefiles that correspond to the individual sub-basin number of
hydro-environmental characteristics. In this way, that allows us to automate the process of
data extraction and preprocessing.</p>
      <p>Environmental attributes from the HydroATLAS database are stored in six different categories:
hydrology; physiography; climate; land cover &amp; use; soils &amp; geology; and anthropogenic influences.
However, we reduce the number of considered attributes from 281 (in the original dataset) to 149. The
reduction has been determined by the expert screening that defines the suitability of available
characteristics for further use in hydrological modelling studies. The table with original BasinATLAS
and expert-guided variants of datasets alongside auxiliary information is available on GitHub.
2.2.</p>
    </sec>
    <sec id="sec-4">
      <title>Research Catchments</title>
      <p>
        For the test case study, we select research catchments of rivers across the Russian Federation with
areas from 50 to 50 000 km². In total, our study includes 1018 catchments from the OpenForecast
system [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. The boundaries of the respective catchment are stored in shapefiles, allowing us to
manipulate them in a programmatic way using specialized software libraries. Boundaries’ availability
is the sole requirement for the developed computational scripts. Thus, the provided approach for
feature extraction (Figure 1) allows us to prepare the unified and consistent set of catchment
descriptors for any river catchment across the globe with the digitized boundary.
2.3.
      </p>
    </sec>
    <sec id="sec-5">
      <title>Computational workflow</title>
      <p>To perform calculations, we need two inputs: (1) a shape boundary of the analyzed catchment and
(2) a pre-downloaded BasinATLAS dataset [11]. Then, the workflow is as follows.
1. With the usage of Fiona [12] and GeoPandas [13] libraries, we read the shapefiles of
boundaries and shapefiles of BasinATLAS datasets.
2. After reading the data, we start to perform intersection procedures based on the use of
GeoPandas and shapely libraries’ spatial functions. We use the sub-basin layer with the
highest spatial resolution from the BasinATLAS dataset, where individual basin splits are
approximately 50 sq. km.
3. To perform an intersection, there is one more step required. Before assigning sub-basin
characteristics from BasinATLAS to the targeted basin, it is necessary to ensure that the
subbasin intersects enough with it. Thus, we calculate the fraction of the sub-basin and the
intersection of it with the target basin. If the intersection share is more than 0.2, then the
considered sub-basin could be included as characterizing.
4. After the intersection procedure, the next step is calculating aggregated values and splitting
them to separate datasets based on their affiliation. The attributes, in general, can be divided
into two types of data: qualitative (land cover, lithological classes) and quantitative (air
temperature, extents of different characteristics). To aggregate quantitative attributes, a
weighted mean was used. For the aggregation of the qualitative attributes, we use the spatial
majority, i.e., we assign the most popular class from sub-basins as the descriptor of the whole
target catchment.
5. After aggregation, the individual results are separated into separate sets describing:
hydrology, physiography, climate, land cover &amp; use, soils &amp; geology, and anthropogenic
characteristics.
6. To ensure linear speed up the calculations, the workflow has been parallelized using the
standard multiprocessing library. The use of 8 threads (CPU: Intel 10700k) achieved a
computational time of 1 hour for 1018 analyzed catchments.
7. After the main calculation procedure, the final results could be saved in any user-defined
format available in standard pandas functionality (e.g., .csv, .tsv, .xls).</p>
      <p>
        The resulting computational script — featureXtractor — is written in Python programming
language[14] and is entirely based on open and freely distributed software packages: NumPy [15],
pandas [16], geopandas [13] shapely [17]. It is available and ready to use in the GitHub repository [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]
under the MIT license.
      </p>
    </sec>
    <sec id="sec-6">
      <title>3. Results</title>
      <p>The proposed script returns six files according to each category from the BasinATLAS dataset.
Each file represents individual attributes as columns that simplify the further analysis. The first
column of every file represents the unique basin ID. That is the anchor which builds a relation
between files of different categories.</p>
      <p>Figure 2 shows the spatial distribution of the OpenForecast basin dataset and the number of
attributs obtained from the computation.</p>
      <p>
        The distribution of the analyzed environmental variables (Figure 2) gives a reliable representation
of features’ spatial heterogeneity across the analyzed catchments. Also, the analyzed features
correspond to specific landscapes and geographic regions. All obtained results and the code for their
analysis and visualization are available in the GitHub repository [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
      </p>
    </sec>
    <sec id="sec-7">
      <title>4. Conclusion and Outlook</title>
      <p>We introduced a universal tool and unified approach for obtaining an extensive, descriptive, and
consistent set of hydro-climatic and landscape characteristics. The presented tool is a state-of-the-art
and readily available swiss-knife for obtaining the set of catchment attributes for any river catchment
across the globe. This tool was tested using 1018 river catchments on the territory of the Russian
Federation and proved its efficiency for obtaining input data that is usually required for large-scale
hydrological studies. The obtained wide range of geophysical characteristics opens new opportunities
to quantitatively explore how the interplay between topography, climate, land cover, soil, and geology
shapes hydrological behavior. Global coverage of the BasinATLAS dataset and open-source approach
of the presented tool enables a possibility to test any hypothesis about the hydrological system
functioning based on the consistent sample of catchment attributes available for any river catchment
across the globe.</p>
      <p>
        The field of hydrological modeling benefits from the introduced instrument. Modern data-driven
models for runoff formation could assimilate the representation of catchment attributes while
optimizing their parameters could lead to more reliable results [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Also, the vector of catchment
attributes could provide deeper insights into hydrological processes that underlie runoff formation
mechanisms.
      </p>
      <p>Last but not least. We urge that research reproducibility brings benefits for a broad range of
specialists. Thus, the developed tool makes hard-to-obtain data of catchment attributes easily
accessible yet consistent and reliable. In this way, featureXtractor democratizes research in
hydrological modeling, making one of the research-intensive procedures — data preparation —
available for a broad community that wants to push forward citizen science.</p>
    </sec>
    <sec id="sec-8">
      <title>5. Acknowledgements</title>
      <p>The reported study was funded by RFBR, project number 19-35-60005.</p>
      <p>The studies were carried out using the resources of the Center for Shared Use of Scientific
Equipment "Center for Processing and Storage of Scientific Data of the Far Eastern Branch of the
Russian Academy of Sciences", funded by the Russian Federation represented by the Ministry of
Science and Higher Education of the Russian Federation under project No. 075-15-2021-663.</p>
    </sec>
    <sec id="sec-9">
      <title>6. References</title>
      <p>[11] Lehner, B.; Linke, S.; Thieme, M. (2019): HydroATLAS version 1.0. figshare. Dataset.</p>
      <p>https://doi.org/10.6084/m9.figshare.9890531.v1
[12] Fiona is GDAL’s neat and nimble vector API for Python programmers, 2021, URL:
https://pypi.org/project/Fiona/
[13] Kelsey Jordahl, Joris Van den Bossche, Martin Fleischmann, Jacob Wasserman, James McBride,
Jeffrey Gerard, François Leblanc. (2020, July 15). geopandas/geopandas: v0.8.1 (Version
v0.8.1). Zenodo. http://doi.org/10.5281/zenodo.3946761
[14] Python Core Team, Python Programming Language, 2021, URL: https://www.python.org/
[15] Harris, C. R., Jarrod M., Stéfan J. van der Walt, Gommers R.,Virtanen P., Cournapeau D.,
Wieser E., et al. “Array Programming with NumPy.” Nature 585, no. 7825 (September 2020):
357–62. https://doi.org/10.1038/s41586-020-2649-2.
[16] Reback, J.,McKinney W., Van den Bossche J., Augspurger T., Cloud P., et al.
Pandas</p>
      <p>Dev/Pandas: Pandas 1.0.3. Zenodo, 2020. https://doi.org/10.5281/zenodo.3715232.
[17] Gillies S. and others, toblerity.org, “Shapely: manipulation and analysis of geometric objects”,
2021, URL: https://github.com/Toblerity/Shapely</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Dmitriy</surname>
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ayzel</surname>
            <given-names>G.</given-names>
          </string-name>
          , featureXtractor, (
          <year>2021</year>
          ),
          <article-title>GitHub repository</article-title>
          , URL: https://github.com/dmbrmv/featureXtractor
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Glushkov</surname>
            <given-names>V. G.</given-names>
          </string-name>
          :
          <article-title>“Geographic-hydrological method</article-title>
          .
          <source>” Proc. of SHI</source>
          , No.
          <fpage>57</fpage>
          -
          <lpage>58</lpage>
          (
          <year>1933</year>
          ) [in Russian].
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Grigoriev</surname>
            <given-names>A.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Budyko</surname>
            <given-names>M.I.</given-names>
          </string-name>
          “
          <article-title>On the periodic law of geographic zoning” Reports of the USSR Academy of Sciences</article-title>
          .
          <year>1956</year>
          . vol.
          <volume>110</volume>
          . № 1. p.
          <fpage>129</fpage>
          -
          <lpage>132</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Kratzert</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Daniel</surname>
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Guy</surname>
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Günter</surname>
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sepp</surname>
            <given-names>H.</given-names>
          </string-name>
          , and Grey N..
          <source>“Towards Learning Universal, Regional, and Local Hydrological Behaviors via Machine Learning Applied to Large-Sample Datasets.” Hydrology and Earth System Sciences</source>
          <volume>23</volume>
          , no.
          <issue>12</issue>
          (
          <issue>December</issue>
          17,
          <year>2019</year>
          ):
          <fpage>5089</fpage>
          -
          <lpage>5110</lpage>
          . https://doi.org/10.5194/hess-23-
          <fpage>5089</fpage>
          -
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Addor</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Andrew</surname>
            <given-names>J. N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Naoki</surname>
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>and Martyn P. C.. “</surname>
          </string-name>
          <article-title>The CAMELS Data Set: Catchment Attributes and Meteorology for Large-Sample Studies</article-title>
          .”
          <source>Hydrology and Earth System Sciences</source>
          <volume>21</volume>
          , no.
          <source>10 (October</source>
          <volume>20</volume>
          ,
          <year>2017</year>
          ):
          <fpage>5293</fpage>
          -
          <lpage>5313</lpage>
          . https://doi.org/10.5194/hess-21-
          <fpage>5293</fpage>
          -
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Klingler</surname>
            ,
            <given-names>C</given-names>
          </string-name>
          , Karsten S., and Mathew H. “
          <article-title>LamaH | LaRge-SamPle DaTa for HYdrology and Environmental Sciences for Central Europe</article-title>
          .”
          <source>Earth System Science Data Discussions, March</source>
          <volume>18</volume>
          ,
          <year>2021</year>
          ,
          <fpage>1</fpage>
          -
          <lpage>46</lpage>
          . https://doi.org/10.5194/essd-2021-72.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>Duan</surname>
            ,
            <given-names>Q.</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Schaake</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Andréassian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Franks</surname>
          </string-name>
          , G. Goteti,
          <string-name>
            <given-names>H. V.</given-names>
            <surname>Gupta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y. M.</given-names>
            <surname>Gusev</surname>
          </string-name>
          , et al. “
          <article-title>Model Parameter Estimation Experiment (MOPEX): An Overview of Science Strategy and Major Results from the Second</article-title>
          and Third Workshops.
          <source>” Journal of Hydrology</source>
          ,
          <source>The model parameter estimation experiment</source>
          ,
          <volume>320</volume>
          , no.
          <issue>1</issue>
          (
          <issue>March</issue>
          30,
          <year>2006</year>
          ):
          <fpage>3</fpage>
          -
          <lpage>17</lpage>
          . https://doi.org/10.1016/j.jhydrol.
          <year>2005</year>
          .
          <volume>07</volume>
          .031.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <surname>Arsenault</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rachel</surname>
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Camille O. D.</surname>
          </string-name>
          , and
          <string-name>
            <surname>François</surname>
            <given-names>B.. “</given-names>
          </string-name>
          <article-title>CANOPEX: A Canadian Hydrometeorological Watershed Database</article-title>
          .”
          <source>Hydrological Processes</source>
          <volume>30</volume>
          , no.
          <volume>15</volume>
          (
          <year>2016</year>
          ):
          <fpage>2734</fpage>
          -
          <lpage>36</lpage>
          . https://doi.org/10.1002/hyp.10880.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <surname>Linke</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bernhard</surname>
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Camille O. D.</surname>
          </string-name>
          ,
          <string-name>
            <surname>Joseph</surname>
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Günther</surname>
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mira</surname>
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Penny</surname>
            <given-names>B.</given-names>
          </string-name>
          , et al. “
          <article-title>Global Hydro-Environmental Sub-Basin and River Reach Characteristics at High Spatial Resolution.” Scientific Data 6</article-title>
          , no.
          <issue>1</issue>
          (
          <issue>December 9</issue>
          ,
          <year>2019</year>
          ):
          <fpage>283</fpage>
          . https://doi.org/10.1038/s41597-019-0300-6.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Ayzel</surname>
          </string-name>
          , G..
          <article-title>“OpenForecast v2: Development and Benchmarking of the First National-Scale Operational Runoff Forecasting System in Russia.” Hydrology 8</article-title>
          , no.
          <source>1 (March</source>
          <year>2021</year>
          )
          <article-title>: 3</article-title>
          . https://doi.org/10.3390/hydrology8010003.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>