=Paper= {{Paper |id=Vol-1752/paper33 |storemode=property |title= Data Management of the Environmental Monitoring Network: UNECE ICP Vegetation Case |pdfUrl=https://ceur-ws.org/Vol-1752/paper33.pdf |volume=Vol-1752 |authors=Gennady Ososkov,Marina Frontasyeva,Alexander Uzhinskiy,Nikolay Kutovskiy,B. Rumyantsev,Andrey Nechaevsky,Sergey Mitsyn,K. Vergel |dblpUrl=https://dblp.org/rec/conf/rcdl/OsoskovFUKRNMV16 }} == Data Management of the Environmental Monitoring Network: UNECE ICP Vegetation Case == https://ceur-ws.org/Vol-1752/paper33.pdf
        Data Management of the Environmental Monitoring
             Network: UNECE ICP Vegetation Case

           G. Ososkov1, M. Frontasyeva2, A. Uzhinskiy1, N. Kutovskiy1, B. Rumyantsev1,2,
                                A. Nechaevsky1, S. Mitsyn1, K. Vergel2
        1
          Laboratory of Information Technologies and 2Frank Laboratory of Neutron Physics
                Joint Institute for Nuclear Research, Dubna, Moscow Region, Russia
                          ososkov@jinr.ru                 marina@nf.jinr.ru


                       Abstract
                                                                    2 Experiment and data interpretation
A new data management cloud platform is presented.
The platform is to be applied for global air pollution                  Sampling is carried out in compliance with the
monitoring purposes to assess the pathway of pollutants             internationally accepted guidelines [4]. Such analytical
in the atmosphere. For this purpose a set of inter-                 techniques as AAS, AFS, CVAAS, CVAFS, ETAAS,
connected services and tools will be developed and                  FAAS, GFAAS, ICP-ES, ICP-MS, as well an INAA are
hosted in the JINR cloud.                                           used for elemental determination. A total of 13 elements
                                                                    are reported to the Atlas (As, Cd, Cr, Cu, Fe, Hg, Ni, Pb,
1 Introduction                                                      V, Zn, Al, Sb, and N). Nowadays POPs (whichever
                                                                    determined) and radionuclides (namely, 210Pb and
    Air pollution has a significant negative impact on the          137Cs) are accepted for air pollution characterization.
various components of ecosystems, human health, and                 The results are reported as number of sampling sites,
ultimately cause significant economic damage. That is               minimum, maximum and median concentrations in
why air pollution is a main concern of the Doctrines of             mg/kg. The data interpretation is based on Multivariate
the environmental safety all over the world. Increased              statistical analysis (factor analysis), description of
ratification of the Protocols of the Convention on Long-            sampling sites (MossMet information package) and
range Transboundary Air Pollution (LRTAP) is                        distribution maps for each element produced using
identified as a high priority in the new long-term strategy         ArcMAP, part of ArcGIS, an integrated geographical
of the Convention. Full implementation of air pollution             information system (GIS) [5]. Examples of GIS maps are
abatement policies is particularly desirable for countries          presented in Fig. 1.
of Eastern Europe, the Caucasus and Central Asia
(EECCA) and South-Eastern Europe (SEE).
Atmospheric deposition study of heavy metals, nitrogen,
persistent organic compounds (POPs) and radionuclides
is based on the analysis of naturally growing mosses
through moss surveys carried out every 5 years [1]. Due
to intense activity of the Joint Institute for Nuclear
Research (JINR), as a coordinator of the moss surveys
since 2014, Azerbaijan, Belarus, Georgia, Kazakhstan,
Moldova, Turkey and Ukraine participated in the moss
survey for 2015/2016. Nowadays the UNECE ICP
Vegetation programme [2] is realized in 36 countries of
Europe and Asia. Mosses are collected at thousands of
sites across Europe and their heavy metal (since 1990),
nitrogen (since 2005), POPs (pilot study in 2010) and
radionuclides (since 2015) concentrations are
determined. The goal of this study program is to identify
the main polluted areas, produce regional maps and
further develop the understanding of long-range
transboundary pollution [3].



Proceedings of the XVIII International Conference                   Figure 1 Examples of distribution maps [3]
«Data Analytics and Management in Data Intensive
Domains» (DAMDID/RCDL’2016), Ershovo, Russia,                           Analytical results and information on the sampling
October 11 - 14, 2016                                               sites (MossMet set) reported to JINR include confidential




                                                              202
acceptance of the data from individual contributors, the           enables verifying obtained results and optimizes
storage of large data arrays, their initial multivariate           research. The open part of the platform can be used for
statistical possessing followed by applying GIS                    informing public authorities, local governments, legal
technology, and the use of artificial neural networks for          entities and individuals about state-of-environment
predicting concentrations of chemical elements in                  changes.
various environments.                                                  One more important aspect of ecological researches
    As an example of the importance of this study, the             relates to various statistical methods applied to process
tendency of average median metal concentrations in                 collected data. Modern approaches to explore air
moss (± one standard deviation) since 1990 to 2010 are             pollutions provided by heavy metals, nitrogen, POPs and
presented in Fig. 2.                                               radionuclides include as a mandatory part multivariable
                                                                   statistical and intellectual data processing. Latest
                                                                   tendencies in data processing include extension of a set
                                                                   of georeferenced data that is integrated in data processing
                                                                   of surveyed data. So it is not limited by geographical,
                                                                   topographical or geological information, what is
                                                                   traditional in such cases, but also includes, for example,
                                                                   satellite imagery and their products, topographic high-
                                                                   precision data derived from aerial photography, etc.
                                                                   These new data classes, contrary to the traditional ones,
                                                                   are characterized by a high resolution and dynamic
                                                                   nature – for example, satellite images represents a
                                                                   reflection of solar radiation, which depends on the time
                                                                   of day, season, cloud cover, etc. This in turn greatly
                                                                   increases the amount of data to be processed. The task of
Figure 2 Change in atmospheric deposition of elements              integration of different types of data is tied to the problem
in time. The black dots in the graphs show the decline in          of the development of new models and algorithms – such
deposition across Europe and blue dots as modeled by the           as neural networks [6], self-organizing maps [7], etc. –
Environmental Monitoring and Assessment Program                    during the study of dynamic properties of ecological
                                                                   processes among other things.
                                                                       So, one more aim of our project is to develop modern
3 Motivation and aim                                               software tools for multivariable statistical and
    As discussed above, the ICP Vegetation programme               intellectual data processing oriented on the GIS-
is very important project, but it has a serious weakness           technology.
related to its weak adoption of modern informational
technologies. There are dozens of respondents in existing          4 ICP vegetation data and required
monitoring network and their number is increasing, but             resources
information on collecting and processing of samples is
carried out manually or with minimum automation. Data                  The moss data are to be collected for about 50
mostly stored in xls files and aggregated manually by the          countries in Europe, Asia and Central America. Each
coordinator. Files from respondents are usually passed to          country has more than 100 monitored points, and several
the coordinator by email or by ordinary mail. There are            hundred parameters must be taken into account for each
no common standards in data transfer, storing and                  of them. A bulky archive is needed to perform
processing software. Such situation does not meet the              comparative studies and to estimate dynamics of
modern standards for quality, effectiveness and speed of           explored air pollution processes. Keeping in mind the
research. Lack of a single web-platform that provides              intensive data exchange and non-relational and poor-
comprehensive solution of biological monitoring and                structured character of data we can assess the size of our
forecasting tasks is a major problem for research.                 database on the level of terabytes.
    Therefore the aim of the project is to create a cloud              Thus it is necessary for scientists to manage large
platform using modern analytical, statistical,                     amounts of data, and it leads to many non-trivial
programmatic and organizational methods to provide the             problems in IT field. It seems natural that a solution
scientific community with unified system of gathering,             should be centralized and outsourced to a cloud.
storing, analyzing, processing, sharing and collective                 From a cloud point of view, the amount of data and
usage of biological monitoring data.                               computing leads to data intensive processing.
    The platform elements are to facilitate IT-aspects of
all biological monitoring stages starting from a choice of         5 Data management on the unified cloud
collection places and parameters of samples description            platform
and finishing with generation of pollution maps of a
particular area or state-of-environment forecast in the                To optimize the whole procedure of data
long term. Mechanisms and tools for association of                 management, it is proposed to build a unified platform
participants of heterogeneous networks of biological               consisting of a set of interconnected services and tools to
monitoring are to be provided in the platform. That                be developed, deployed and hosted in the JINR cloud [6].




                                                             203
The JINR cloud currently has 400 CPU cores, 1000 TB                 pollution distributions and basic instruments to configure
of RAM and about 30 ТB of total local disk space on                 the map are presented.
cloud worker nodes for virtual machines and containers                  We have tried QGIS (Open Source Geographic
deployment. Hosting services in the cloud allows scaling            Information System [11]) and OpenLayers (opensource
up and down cloud resources assign to the services                  javascript library to load, display and render maps from
depending on theirs load. When some component will                  multiple sources) for regional and global maps
require more resources cloud can provide it without                 representation. But QGIS and its web plugin is too hard
affecting other components. This increases the efficiency           to maintain and develop. Now we are using OpenLayers
of hardware utilization as well as the reliability and              [12] and some of its specific layers that allows to do basic
availability of the service itself for the end-users. Such          interpolation to create concentration maps.
auto-scaling behavior will be achieved by using the                     Another interface to the platform is RESTful service
OneFlow component of OpenNebula platform [6], which                 [13] that we are going to provide to the mobile and
the JINR cloud is based on.                                         desktop application and also for third-party services that
    We define requirements for the platform and specify             can be interested in the environmental monitoring data.
its components. The general architecture of the platform                Data import and export mechanism will be available
and technologies used are depicted in Fig. 3.                       for the platform, so users can process data online or
                                                                    upload it and use their local processing application.
                                                                    Intelligent multi-level statistical data processing is one of
                                                                    the platform important parts. We have tried several
                                                                    solutions but statistical and analytical packages are still
                                                                    under discussion. A very promising direction is the use
                                                                    of artificial neural network applications for predicting
                                                                    concentrations of chemical elements in various
                                                                    environments. We have done some research in this field
                                                                    but do not yet have the finished solution.

                                                                      6 Prediction and GIS-oriented data
                                                                    processing
Figure 3 General architecture of the platform and                         Prediction is an important step of data analysis of any
technologies used                                                   ecological survey. Application of prediction methods
                                                                    enables mapping of estimate values. Maps in their turn
    We analyze data that comes from the contributors.               provide visualization of spatial variability of data and can
The data samples can have 10 to 40 metrics depending on             be used for visual analysis so that ecological hazards can
the collecting area. Most of the metrics are optional, so           be identified [14].
traditional relation databases will be ineffective. We also               Kriging is a widely-used interpolation technique used
want to have a possibility to change structure of the data          for prediction, e.g. concentration of heavy elements in
sample object without hard code modification to easily              moss [15], soil contamination [6]. Recently more and
integrate new projects and experiments into the platform.           more research is made towards integration of different
We have a positive experience with MongoDB (open-                   data sources like aerial and satellite photography together
source, document database designed for ease of                      with incorporation of new methods like artificial neural
development and scaling [9]) at our previous projects               networks.
where more than 5 million data records from 200+                          Mathematically, given a discrete function 𝑓𝑓(𝑥𝑥𝑖𝑖 , 𝑦𝑦𝑖𝑖 )
contributors are processed so it was decided to use the             (response variable defined by measurements over
data base to store sampling results.                                Cartesian coordinates) on an irregular grid of a set of
    The portal back-end will be built on Nginx (an open             points 𝑉𝑉 = {(𝑥𝑥𝑖𝑖 , 𝑦𝑦𝑖𝑖 )}, an interpolation procedure finds
source reverse proxy web server for HTTP protocol [10])             𝑓𝑓̃(𝑥𝑥, 𝑦𝑦) for 𝑓𝑓 such that 𝑓𝑓̃(𝑥𝑥, 𝑦𝑦) is prediction for 𝑓𝑓 for
and developed with PHP (widely-used open source                     ∀(𝑥𝑥, 𝑦𝑦) ∈ ℝ2 . Integration of data is done in such a way
general-purpose scripting language that is especially               that helps an interpolation procedure, like artificial neural
suited for web development). That should provide                    network frameworks, to make a better predictor
necessary performance and scalability. Web-portal with              (interpolator). Such an approach is based on a conjecture
responsive design that adjusts to different screen sizes is         that neural networks are capable to employ hidden non-
the main interface of the platform. The portal allows               linear correlations that exist and hidden in the data.
multilevel access to the data and has advanced data                          Formally, if compared to “classic” interpolation
processing and reporting mechanisms. Currently basic                where predictor variables are limited by Cartesian
functionality of the portal has been implemented and                coordinates, in this case a set of predictors is to be
authorized users can manage their project/regions, import           expanded with other predictor variables 𝑔𝑔1 (𝑥𝑥, 𝑦𝑦),
data samples and generate regional maps. At top of Fig.             𝑔𝑔2 (𝑥𝑥, 𝑦𝑦),…, 𝑔𝑔𝑛𝑛 (𝑥𝑥, 𝑦𝑦). These can be topographical
4 one can see the interface for project management where            features, elevation, products of aerial photography and
contributor can add, delete, edit or copy the datasets. At          satellite imagery and many more different surface
bottom of the figure the map with the indication of                 properties.




                                                              204
Figure 4 Web-portal interfaces

          Thus the interpolation method is replaced in this                                           variables, as aerial and satellite protography, but has
way: given                                                                                            some problems with applying them to real-world data.
          𝑓𝑓�𝑥𝑥𝑖𝑖 , 𝑦𝑦𝑖𝑖 , 𝑔𝑔1 (𝑥𝑥𝑖𝑖 , 𝑦𝑦𝑖𝑖 ), 𝑔𝑔2 (𝑥𝑥𝑖𝑖 , 𝑦𝑦𝑖𝑖 ), … , 𝑔𝑔𝑛𝑛 (𝑥𝑥𝑖𝑖 , 𝑦𝑦𝑖𝑖 )� =         Kriging is oriented on data that is normally distributed.
𝑓𝑓(𝑥𝑥𝑖𝑖 , 𝑦𝑦𝑖𝑖 ) ∀(𝑥𝑥𝑖𝑖 , 𝑦𝑦𝑖𝑖 ) ∈ 𝑉𝑉,                                                                While it is somewhat true for lags of spatial coordinates
build a predictor                                                                                     that are utilized in semivariogram and covariance
          𝑓𝑓̃�𝑥𝑥, 𝑦𝑦, 𝑔𝑔1 (𝑥𝑥, 𝑦𝑦), 𝑔𝑔2 (𝑥𝑥, 𝑦𝑦), … , 𝑔𝑔𝑛𝑛 (𝑥𝑥, 𝑦𝑦)�                                  function, it is not always true for secondary predictors 𝑔𝑔𝑖𝑖 .
and establish an equality                                                                             Also, it is problematic to construct a covariance function
                                                                                                      as it may naturally be anisotropic and even non-
 𝑓𝑓̃(𝑥𝑥, 𝑦𝑦) = 𝑓𝑓̃(𝑥𝑥, 𝑦𝑦, 𝑔𝑔1 (𝑥𝑥, 𝑦𝑦),𝑔𝑔2 (𝑥𝑥, 𝑦𝑦),…,𝑔𝑔𝑛𝑛 (𝑥𝑥, 𝑦𝑦)). While
                                                                                                      symmetric. An artificial neural network, on the other
extended form for 𝑓𝑓̃(𝑥𝑥, 𝑦𝑦) seems more complicated, it
                                                                                                      hand, automatically adapts to nonlinearities and non-
simply allows an interpolation method to fuse in possible
                                                                                                      normally distributed variables.
non-linear correlations and make “better” predictions,
                                                                                                           This approach also constitutes some problems which
while other formal parts of the method stay the same.
                                                                                                      are inherent to artificial neural networks. As each
       A modification of kriging called cokriging has been
                                                                                                      concrete neural network is a product of a learning
proposed [16]. It is oriented on using these secondary




                                                                                                205
procedure, some form of predictor evaluation has to be             [2] Harmens H. and Mills G. (Eds.) Air Pollution:
incorporated. Usually, several different learning                      Deposition to and impacts on vegetation in (South-
procedures and network topologies are evaluated and                    East Europe, Caucasus, Central Asia (EECCA/SEE)
results of interpolation procedure are analyzed for                    and South-East Asia. Report prepared by ICP
deficiencies like overfitting. Such superprocedure                     Vegetation, March 2014. ICP Vegetation
effectively increases computational costs and may be                   Programme Coordination Centre, Centre for
sped up with parallel computing. Other problems are
                                                                       Ecology and Hydrology, Bangor, UK. ISBN: 978-1-
caused by data specifics, so some approaches of
                                                                       906698-48-5, 2014, 72p.
regularization should be employed, like learning with
Gaussian noise.                                                    [3] Harmens H., Norris D.A., Sharps K., Mills G. …
     Different types of satellite imagery are currently                Frontasyeva M., et al. Heavy metal and nitrogen
employed in data processing, like LandSat [17] and
                                                                       concentrations in mosses are declining across
MODIS         (the      Moderate-resolution      imaging
                                                                       Europe whilst some “hotspots” remain in 2010.
spectroradiometer [18]). The latter project incorporates
two satellites with spectroradiometers (hence the name)                Environmental Pollution. 2015, 200:p. 93-104.
that is able to take satellite imagery with high spectral              http://dx.doi.org/10.1016/j.envpol.2015.01.036
resolution of 36 spectral bands. Whilst, if compared to
                                                                   [4] HEAVY METALS, NITROGEN AND POPs IN
LandSat, it has moderate resolution (hence the name), it
allows deeper and more thorough analyses of Earth                      EUROPEAN           MOSSES:       2015      SURVEY
surface, thus enabling interesting possibilities for                   http://icpvegetation.ceh.ac.uk/publications/docume
research towards correlation and causality (e.g.                       nts/MossmonitoringMANUAL-2015-17.07.14.pdf
contamination spreading catalysts and accelerants).
                                                                   [5] Buse A. et al. (2003). Heavy metals in European
     Using raw spectral radiation bands of spectral
imagery is confronted with obvious interfering factors                 mosses: 2000/2001 survey. UNECE ICP Vegetation
such as sun azimuth, time of day and season, surface                   Coordination Centre, Centre for Ecology and
altitude and slope and other.                                          Hydrology,                Bangor,          UK.
     Thus, in addition to the running standard statistical             http://icpvegetation.ceh.ac.uk.
procedures which calculated descriptive statistics and
factor analysis, neural network data processing is                 [6] J. Alijagić, 2013. Application of multivariate
considered to be used in the given project, together with              statistical methods and artificial neural network for
various MODIS products, as surface reflectance, land                   separation natural background and influence of
surface temperature, land cover, vegetation indices, land              mining and metallurgy activities on distribution of
use, etc.                                                              chemical elements in the Stavnja valley (Bosnia and
                                                                       Herzegovina) : PhD thesis. University of Nova
7 Conclusion                                                           Gorica.
     The study of migration and accumulation of highly             [7] Žibret, G., Šajn, R., 2010. Hunting for Geochemical
toxic pollutants, which include heavy metals, persistent               Associations of Elements: Factor Analysis and Self-
organic pollutants and radionuclides, the influence of                 Organising Maps. Mathematical Geosciences,
pollutants on the various components of the natural and                42(6): 681–703, doi:10.1007/s11004-010-9288-3.
urban ecosystems is the key problem of modern                          http://link.springer.com/article/10.1007/s11004-
biogeochemistry and ecology. The aim of the given
                                                                       010-9288-3
project is to create cloud platform using modern
analytical, statistical, programmatic and organizational           [8] Kutovskiy N., Korenkov V., Balashov N., Baranov
methods to provide the scientific community with unified               A., Semenov R. JINR cloud infrastructure. Procedia
system of collecting, analyzing and processing of                      Computer Science, ISSN: 1877-0509, Publisher:
biological monitoring data.
                                                                       Elsevier. 2015, 66, p. 574-583.
     Parts of the project have already been implemented.
The rest is going to be implemented in the next two years.         [9] MongoDB      site  and   description,          URL:
                                                                       https://www.mongodb.com/

                                                                   [10] Nginx           for         Windows           URL:
                                                                        http://nginx.org/ru/docs/windows.html
References
                                                                   [11] QGIS                description               URL:
[1] United Nations Economic Commission for Europe                       http://www.qgis.org/en/site/
    International Cooperative Programme on Effects of
    Air Pollution on Natural Vegetation and Crops                  [12] OpenLayers              description           URL:
    (http://icpvegetation.ceh.ac.uk/                                    http://docs.openlayers.org/




                                                             206
[13] RESTful Web services: The basics URL:                   [16] H. Wackernagel Cokriging versus kriging in
     https://www.ibm.com/developerworks/library/ws-               regionalized multivariate data analysis. Geoderma,
     restful/                                                     62 (1994) 83-92
[14] Goodchild M.F., Parks B.O., Steyaret L.T. 1993.
                                                             [17] LandSat        program       description    URL:
     Environmental modelling with GIS, Oxford
                                                                  http://yceo.yale.edu/what-landsat-program
     University Press, New York, 488 p.
[15] S. Nickel et al. / Atmospheric Environment 99           [18] MODIS spectrometer on TERRA satellite URL:
     (2014) 85e93                                                 http://modis.gsfc.nasa.gov/about/




                                                       207