=Paper=
{{Paper
|id=Vol-1752/paper33
|storemode=property
|title=
Data Management of the Environmental Monitoring Network: UNECE ICP Vegetation Case
|pdfUrl=https://ceur-ws.org/Vol-1752/paper33.pdf
|volume=Vol-1752
|authors=Gennady Ososkov,Marina Frontasyeva,Alexander Uzhinskiy,Nikolay Kutovskiy,B. Rumyantsev,Andrey Nechaevsky,Sergey Mitsyn,K. Vergel
|dblpUrl=https://dblp.org/rec/conf/rcdl/OsoskovFUKRNMV16
}}
==
Data Management of the Environmental Monitoring Network: UNECE ICP Vegetation Case
==
Data Management of the Environmental Monitoring
Network: UNECE ICP Vegetation Case
G. Ososkov1, M. Frontasyeva2, A. Uzhinskiy1, N. Kutovskiy1, B. Rumyantsev1,2,
A. Nechaevsky1, S. Mitsyn1, K. Vergel2
1
Laboratory of Information Technologies and 2Frank Laboratory of Neutron Physics
Joint Institute for Nuclear Research, Dubna, Moscow Region, Russia
ososkov@jinr.ru marina@nf.jinr.ru
Abstract
2 Experiment and data interpretation
A new data management cloud platform is presented.
The platform is to be applied for global air pollution Sampling is carried out in compliance with the
monitoring purposes to assess the pathway of pollutants internationally accepted guidelines [4]. Such analytical
in the atmosphere. For this purpose a set of inter- techniques as AAS, AFS, CVAAS, CVAFS, ETAAS,
connected services and tools will be developed and FAAS, GFAAS, ICP-ES, ICP-MS, as well an INAA are
hosted in the JINR cloud. used for elemental determination. A total of 13 elements
are reported to the Atlas (As, Cd, Cr, Cu, Fe, Hg, Ni, Pb,
1 Introduction V, Zn, Al, Sb, and N). Nowadays POPs (whichever
determined) and radionuclides (namely, 210Pb and
Air pollution has a significant negative impact on the 137Cs) are accepted for air pollution characterization.
various components of ecosystems, human health, and The results are reported as number of sampling sites,
ultimately cause significant economic damage. That is minimum, maximum and median concentrations in
why air pollution is a main concern of the Doctrines of mg/kg. The data interpretation is based on Multivariate
the environmental safety all over the world. Increased statistical analysis (factor analysis), description of
ratification of the Protocols of the Convention on Long- sampling sites (MossMet information package) and
range Transboundary Air Pollution (LRTAP) is distribution maps for each element produced using
identified as a high priority in the new long-term strategy ArcMAP, part of ArcGIS, an integrated geographical
of the Convention. Full implementation of air pollution information system (GIS) [5]. Examples of GIS maps are
abatement policies is particularly desirable for countries presented in Fig. 1.
of Eastern Europe, the Caucasus and Central Asia
(EECCA) and South-Eastern Europe (SEE).
Atmospheric deposition study of heavy metals, nitrogen,
persistent organic compounds (POPs) and radionuclides
is based on the analysis of naturally growing mosses
through moss surveys carried out every 5 years [1]. Due
to intense activity of the Joint Institute for Nuclear
Research (JINR), as a coordinator of the moss surveys
since 2014, Azerbaijan, Belarus, Georgia, Kazakhstan,
Moldova, Turkey and Ukraine participated in the moss
survey for 2015/2016. Nowadays the UNECE ICP
Vegetation programme [2] is realized in 36 countries of
Europe and Asia. Mosses are collected at thousands of
sites across Europe and their heavy metal (since 1990),
nitrogen (since 2005), POPs (pilot study in 2010) and
radionuclides (since 2015) concentrations are
determined. The goal of this study program is to identify
the main polluted areas, produce regional maps and
further develop the understanding of long-range
transboundary pollution [3].
Proceedings of the XVIII International Conference Figure 1 Examples of distribution maps [3]
«Data Analytics and Management in Data Intensive
Domains» (DAMDID/RCDL’2016), Ershovo, Russia, Analytical results and information on the sampling
October 11 - 14, 2016 sites (MossMet set) reported to JINR include confidential
202
acceptance of the data from individual contributors, the enables verifying obtained results and optimizes
storage of large data arrays, their initial multivariate research. The open part of the platform can be used for
statistical possessing followed by applying GIS informing public authorities, local governments, legal
technology, and the use of artificial neural networks for entities and individuals about state-of-environment
predicting concentrations of chemical elements in changes.
various environments. One more important aspect of ecological researches
As an example of the importance of this study, the relates to various statistical methods applied to process
tendency of average median metal concentrations in collected data. Modern approaches to explore air
moss (± one standard deviation) since 1990 to 2010 are pollutions provided by heavy metals, nitrogen, POPs and
presented in Fig. 2. radionuclides include as a mandatory part multivariable
statistical and intellectual data processing. Latest
tendencies in data processing include extension of a set
of georeferenced data that is integrated in data processing
of surveyed data. So it is not limited by geographical,
topographical or geological information, what is
traditional in such cases, but also includes, for example,
satellite imagery and their products, topographic high-
precision data derived from aerial photography, etc.
These new data classes, contrary to the traditional ones,
are characterized by a high resolution and dynamic
nature – for example, satellite images represents a
reflection of solar radiation, which depends on the time
of day, season, cloud cover, etc. This in turn greatly
increases the amount of data to be processed. The task of
Figure 2 Change in atmospheric deposition of elements integration of different types of data is tied to the problem
in time. The black dots in the graphs show the decline in of the development of new models and algorithms – such
deposition across Europe and blue dots as modeled by the as neural networks [6], self-organizing maps [7], etc. –
Environmental Monitoring and Assessment Program during the study of dynamic properties of ecological
processes among other things.
So, one more aim of our project is to develop modern
3 Motivation and aim software tools for multivariable statistical and
As discussed above, the ICP Vegetation programme intellectual data processing oriented on the GIS-
is very important project, but it has a serious weakness technology.
related to its weak adoption of modern informational
technologies. There are dozens of respondents in existing 4 ICP vegetation data and required
monitoring network and their number is increasing, but resources
information on collecting and processing of samples is
carried out manually or with minimum automation. Data The moss data are to be collected for about 50
mostly stored in xls files and aggregated manually by the countries in Europe, Asia and Central America. Each
coordinator. Files from respondents are usually passed to country has more than 100 monitored points, and several
the coordinator by email or by ordinary mail. There are hundred parameters must be taken into account for each
no common standards in data transfer, storing and of them. A bulky archive is needed to perform
processing software. Such situation does not meet the comparative studies and to estimate dynamics of
modern standards for quality, effectiveness and speed of explored air pollution processes. Keeping in mind the
research. Lack of a single web-platform that provides intensive data exchange and non-relational and poor-
comprehensive solution of biological monitoring and structured character of data we can assess the size of our
forecasting tasks is a major problem for research. database on the level of terabytes.
Therefore the aim of the project is to create a cloud Thus it is necessary for scientists to manage large
platform using modern analytical, statistical, amounts of data, and it leads to many non-trivial
programmatic and organizational methods to provide the problems in IT field. It seems natural that a solution
scientific community with unified system of gathering, should be centralized and outsourced to a cloud.
storing, analyzing, processing, sharing and collective From a cloud point of view, the amount of data and
usage of biological monitoring data. computing leads to data intensive processing.
The platform elements are to facilitate IT-aspects of
all biological monitoring stages starting from a choice of 5 Data management on the unified cloud
collection places and parameters of samples description platform
and finishing with generation of pollution maps of a
particular area or state-of-environment forecast in the To optimize the whole procedure of data
long term. Mechanisms and tools for association of management, it is proposed to build a unified platform
participants of heterogeneous networks of biological consisting of a set of interconnected services and tools to
monitoring are to be provided in the platform. That be developed, deployed and hosted in the JINR cloud [6].
203
The JINR cloud currently has 400 CPU cores, 1000 TB pollution distributions and basic instruments to configure
of RAM and about 30 ТB of total local disk space on the map are presented.
cloud worker nodes for virtual machines and containers We have tried QGIS (Open Source Geographic
deployment. Hosting services in the cloud allows scaling Information System [11]) and OpenLayers (opensource
up and down cloud resources assign to the services javascript library to load, display and render maps from
depending on theirs load. When some component will multiple sources) for regional and global maps
require more resources cloud can provide it without representation. But QGIS and its web plugin is too hard
affecting other components. This increases the efficiency to maintain and develop. Now we are using OpenLayers
of hardware utilization as well as the reliability and [12] and some of its specific layers that allows to do basic
availability of the service itself for the end-users. Such interpolation to create concentration maps.
auto-scaling behavior will be achieved by using the Another interface to the platform is RESTful service
OneFlow component of OpenNebula platform [6], which [13] that we are going to provide to the mobile and
the JINR cloud is based on. desktop application and also for third-party services that
We define requirements for the platform and specify can be interested in the environmental monitoring data.
its components. The general architecture of the platform Data import and export mechanism will be available
and technologies used are depicted in Fig. 3. for the platform, so users can process data online or
upload it and use their local processing application.
Intelligent multi-level statistical data processing is one of
the platform important parts. We have tried several
solutions but statistical and analytical packages are still
under discussion. A very promising direction is the use
of artificial neural network applications for predicting
concentrations of chemical elements in various
environments. We have done some research in this field
but do not yet have the finished solution.
6 Prediction and GIS-oriented data
processing
Figure 3 General architecture of the platform and Prediction is an important step of data analysis of any
technologies used ecological survey. Application of prediction methods
enables mapping of estimate values. Maps in their turn
We analyze data that comes from the contributors. provide visualization of spatial variability of data and can
The data samples can have 10 to 40 metrics depending on be used for visual analysis so that ecological hazards can
the collecting area. Most of the metrics are optional, so be identified [14].
traditional relation databases will be ineffective. We also Kriging is a widely-used interpolation technique used
want to have a possibility to change structure of the data for prediction, e.g. concentration of heavy elements in
sample object without hard code modification to easily moss [15], soil contamination [6]. Recently more and
integrate new projects and experiments into the platform. more research is made towards integration of different
We have a positive experience with MongoDB (open- data sources like aerial and satellite photography together
source, document database designed for ease of with incorporation of new methods like artificial neural
development and scaling [9]) at our previous projects networks.
where more than 5 million data records from 200+ Mathematically, given a discrete function 𝑓𝑓(𝑥𝑥𝑖𝑖 , 𝑦𝑦𝑖𝑖 )
contributors are processed so it was decided to use the (response variable defined by measurements over
data base to store sampling results. Cartesian coordinates) on an irregular grid of a set of
The portal back-end will be built on Nginx (an open points 𝑉𝑉 = {(𝑥𝑥𝑖𝑖 , 𝑦𝑦𝑖𝑖 )}, an interpolation procedure finds
source reverse proxy web server for HTTP protocol [10]) 𝑓𝑓̃(𝑥𝑥, 𝑦𝑦) for 𝑓𝑓 such that 𝑓𝑓̃(𝑥𝑥, 𝑦𝑦) is prediction for 𝑓𝑓 for
and developed with PHP (widely-used open source ∀(𝑥𝑥, 𝑦𝑦) ∈ ℝ2 . Integration of data is done in such a way
general-purpose scripting language that is especially that helps an interpolation procedure, like artificial neural
suited for web development). That should provide network frameworks, to make a better predictor
necessary performance and scalability. Web-portal with (interpolator). Such an approach is based on a conjecture
responsive design that adjusts to different screen sizes is that neural networks are capable to employ hidden non-
the main interface of the platform. The portal allows linear correlations that exist and hidden in the data.
multilevel access to the data and has advanced data Formally, if compared to “classic” interpolation
processing and reporting mechanisms. Currently basic where predictor variables are limited by Cartesian
functionality of the portal has been implemented and coordinates, in this case a set of predictors is to be
authorized users can manage their project/regions, import expanded with other predictor variables 𝑔𝑔1 (𝑥𝑥, 𝑦𝑦),
data samples and generate regional maps. At top of Fig. 𝑔𝑔2 (𝑥𝑥, 𝑦𝑦),…, 𝑔𝑔𝑛𝑛 (𝑥𝑥, 𝑦𝑦). These can be topographical
4 one can see the interface for project management where features, elevation, products of aerial photography and
contributor can add, delete, edit or copy the datasets. At satellite imagery and many more different surface
bottom of the figure the map with the indication of properties.
204
Figure 4 Web-portal interfaces
Thus the interpolation method is replaced in this variables, as aerial and satellite protography, but has
way: given some problems with applying them to real-world data.
𝑓𝑓�𝑥𝑥𝑖𝑖 , 𝑦𝑦𝑖𝑖 , 𝑔𝑔1 (𝑥𝑥𝑖𝑖 , 𝑦𝑦𝑖𝑖 ), 𝑔𝑔2 (𝑥𝑥𝑖𝑖 , 𝑦𝑦𝑖𝑖 ), … , 𝑔𝑔𝑛𝑛 (𝑥𝑥𝑖𝑖 , 𝑦𝑦𝑖𝑖 )� = Kriging is oriented on data that is normally distributed.
𝑓𝑓(𝑥𝑥𝑖𝑖 , 𝑦𝑦𝑖𝑖 ) ∀(𝑥𝑥𝑖𝑖 , 𝑦𝑦𝑖𝑖 ) ∈ 𝑉𝑉, While it is somewhat true for lags of spatial coordinates
build a predictor that are utilized in semivariogram and covariance
𝑓𝑓̃�𝑥𝑥, 𝑦𝑦, 𝑔𝑔1 (𝑥𝑥, 𝑦𝑦), 𝑔𝑔2 (𝑥𝑥, 𝑦𝑦), … , 𝑔𝑔𝑛𝑛 (𝑥𝑥, 𝑦𝑦)� function, it is not always true for secondary predictors 𝑔𝑔𝑖𝑖 .
and establish an equality Also, it is problematic to construct a covariance function
as it may naturally be anisotropic and even non-
𝑓𝑓̃(𝑥𝑥, 𝑦𝑦) = 𝑓𝑓̃(𝑥𝑥, 𝑦𝑦, 𝑔𝑔1 (𝑥𝑥, 𝑦𝑦),𝑔𝑔2 (𝑥𝑥, 𝑦𝑦),…,𝑔𝑔𝑛𝑛 (𝑥𝑥, 𝑦𝑦)). While
symmetric. An artificial neural network, on the other
extended form for 𝑓𝑓̃(𝑥𝑥, 𝑦𝑦) seems more complicated, it
hand, automatically adapts to nonlinearities and non-
simply allows an interpolation method to fuse in possible
normally distributed variables.
non-linear correlations and make “better” predictions,
This approach also constitutes some problems which
while other formal parts of the method stay the same.
are inherent to artificial neural networks. As each
A modification of kriging called cokriging has been
concrete neural network is a product of a learning
proposed [16]. It is oriented on using these secondary
205
procedure, some form of predictor evaluation has to be [2] Harmens H. and Mills G. (Eds.) Air Pollution:
incorporated. Usually, several different learning Deposition to and impacts on vegetation in (South-
procedures and network topologies are evaluated and East Europe, Caucasus, Central Asia (EECCA/SEE)
results of interpolation procedure are analyzed for and South-East Asia. Report prepared by ICP
deficiencies like overfitting. Such superprocedure Vegetation, March 2014. ICP Vegetation
effectively increases computational costs and may be Programme Coordination Centre, Centre for
sped up with parallel computing. Other problems are
Ecology and Hydrology, Bangor, UK. ISBN: 978-1-
caused by data specifics, so some approaches of
906698-48-5, 2014, 72p.
regularization should be employed, like learning with
Gaussian noise. [3] Harmens H., Norris D.A., Sharps K., Mills G. …
Different types of satellite imagery are currently Frontasyeva M., et al. Heavy metal and nitrogen
employed in data processing, like LandSat [17] and
concentrations in mosses are declining across
MODIS (the Moderate-resolution imaging
Europe whilst some “hotspots” remain in 2010.
spectroradiometer [18]). The latter project incorporates
two satellites with spectroradiometers (hence the name) Environmental Pollution. 2015, 200:p. 93-104.
that is able to take satellite imagery with high spectral http://dx.doi.org/10.1016/j.envpol.2015.01.036
resolution of 36 spectral bands. Whilst, if compared to
[4] HEAVY METALS, NITROGEN AND POPs IN
LandSat, it has moderate resolution (hence the name), it
allows deeper and more thorough analyses of Earth EUROPEAN MOSSES: 2015 SURVEY
surface, thus enabling interesting possibilities for http://icpvegetation.ceh.ac.uk/publications/docume
research towards correlation and causality (e.g. nts/MossmonitoringMANUAL-2015-17.07.14.pdf
contamination spreading catalysts and accelerants).
[5] Buse A. et al. (2003). Heavy metals in European
Using raw spectral radiation bands of spectral
imagery is confronted with obvious interfering factors mosses: 2000/2001 survey. UNECE ICP Vegetation
such as sun azimuth, time of day and season, surface Coordination Centre, Centre for Ecology and
altitude and slope and other. Hydrology, Bangor, UK.
Thus, in addition to the running standard statistical http://icpvegetation.ceh.ac.uk.
procedures which calculated descriptive statistics and
factor analysis, neural network data processing is [6] J. Alijagić, 2013. Application of multivariate
considered to be used in the given project, together with statistical methods and artificial neural network for
various MODIS products, as surface reflectance, land separation natural background and influence of
surface temperature, land cover, vegetation indices, land mining and metallurgy activities on distribution of
use, etc. chemical elements in the Stavnja valley (Bosnia and
Herzegovina) : PhD thesis. University of Nova
7 Conclusion Gorica.
The study of migration and accumulation of highly [7] Žibret, G., Šajn, R., 2010. Hunting for Geochemical
toxic pollutants, which include heavy metals, persistent Associations of Elements: Factor Analysis and Self-
organic pollutants and radionuclides, the influence of Organising Maps. Mathematical Geosciences,
pollutants on the various components of the natural and 42(6): 681–703, doi:10.1007/s11004-010-9288-3.
urban ecosystems is the key problem of modern http://link.springer.com/article/10.1007/s11004-
biogeochemistry and ecology. The aim of the given
010-9288-3
project is to create cloud platform using modern
analytical, statistical, programmatic and organizational [8] Kutovskiy N., Korenkov V., Balashov N., Baranov
methods to provide the scientific community with unified A., Semenov R. JINR cloud infrastructure. Procedia
system of collecting, analyzing and processing of Computer Science, ISSN: 1877-0509, Publisher:
biological monitoring data.
Elsevier. 2015, 66, p. 574-583.
Parts of the project have already been implemented.
The rest is going to be implemented in the next two years. [9] MongoDB site and description, URL:
https://www.mongodb.com/
[10] Nginx for Windows URL:
http://nginx.org/ru/docs/windows.html
References
[11] QGIS description URL:
[1] United Nations Economic Commission for Europe http://www.qgis.org/en/site/
International Cooperative Programme on Effects of
Air Pollution on Natural Vegetation and Crops [12] OpenLayers description URL:
(http://icpvegetation.ceh.ac.uk/ http://docs.openlayers.org/
206
[13] RESTful Web services: The basics URL: [16] H. Wackernagel Cokriging versus kriging in
https://www.ibm.com/developerworks/library/ws- regionalized multivariate data analysis. Geoderma,
restful/ 62 (1994) 83-92
[14] Goodchild M.F., Parks B.O., Steyaret L.T. 1993.
[17] LandSat program description URL:
Environmental modelling with GIS, Oxford
http://yceo.yale.edu/what-landsat-program
University Press, New York, 488 p.
[15] S. Nickel et al. / Atmospheric Environment 99 [18] MODIS spectrometer on TERRA satellite URL:
(2014) 85e93 http://modis.gsfc.nasa.gov/about/
207