Data Management of the Environmental Monitoring Network: UNECE ICP Vegetation Case G. Ososkov1, M. Frontasyeva2, A. Uzhinskiy1, N. Kutovskiy1, B. Rumyantsev1,2, A. Nechaevsky1, S. Mitsyn1, K. Vergel2 1 Laboratory of Information Technologies and 2Frank Laboratory of Neutron Physics Joint Institute for Nuclear Research, Dubna, Moscow Region, Russia ososkov@jinr.ru marina@nf.jinr.ru Abstract 2 Experiment and data interpretation A new data management cloud platform is presented. The platform is to be applied for global air pollution Sampling is carried out in compliance with the monitoring purposes to assess the pathway of pollutants internationally accepted guidelines [4]. Such analytical in the atmosphere. For this purpose a set of inter- techniques as AAS, AFS, CVAAS, CVAFS, ETAAS, connected services and tools will be developed and FAAS, GFAAS, ICP-ES, ICP-MS, as well an INAA are hosted in the JINR cloud. used for elemental determination. A total of 13 elements are reported to the Atlas (As, Cd, Cr, Cu, Fe, Hg, Ni, Pb, 1 Introduction V, Zn, Al, Sb, and N). Nowadays POPs (whichever determined) and radionuclides (namely, 210Pb and Air pollution has a significant negative impact on the 137Cs) are accepted for air pollution characterization. various components of ecosystems, human health, and The results are reported as number of sampling sites, ultimately cause significant economic damage. That is minimum, maximum and median concentrations in why air pollution is a main concern of the Doctrines of mg/kg. The data interpretation is based on Multivariate the environmental safety all over the world. Increased statistical analysis (factor analysis), description of ratification of the Protocols of the Convention on Long- sampling sites (MossMet information package) and range Transboundary Air Pollution (LRTAP) is distribution maps for each element produced using identified as a high priority in the new long-term strategy ArcMAP, part of ArcGIS, an integrated geographical of the Convention. Full implementation of air pollution information system (GIS) [5]. Examples of GIS maps are abatement policies is particularly desirable for countries presented in Fig. 1. of Eastern Europe, the Caucasus and Central Asia (EECCA) and South-Eastern Europe (SEE). Atmospheric deposition study of heavy metals, nitrogen, persistent organic compounds (POPs) and radionuclides is based on the analysis of naturally growing mosses through moss surveys carried out every 5 years [1]. Due to intense activity of the Joint Institute for Nuclear Research (JINR), as a coordinator of the moss surveys since 2014, Azerbaijan, Belarus, Georgia, Kazakhstan, Moldova, Turkey and Ukraine participated in the moss survey for 2015/2016. Nowadays the UNECE ICP Vegetation programme [2] is realized in 36 countries of Europe and Asia. Mosses are collected at thousands of sites across Europe and their heavy metal (since 1990), nitrogen (since 2005), POPs (pilot study in 2010) and radionuclides (since 2015) concentrations are determined. The goal of this study program is to identify the main polluted areas, produce regional maps and further develop the understanding of long-range transboundary pollution [3]. Proceedings of the XVIII International Conference Figure 1 Examples of distribution maps [3] «Data Analytics and Management in Data Intensive Domains» (DAMDID/RCDL’2016), Ershovo, Russia, Analytical results and information on the sampling October 11 - 14, 2016 sites (MossMet set) reported to JINR include confidential 202 acceptance of the data from individual contributors, the enables verifying obtained results and optimizes storage of large data arrays, their initial multivariate research. The open part of the platform can be used for statistical possessing followed by applying GIS informing public authorities, local governments, legal technology, and the use of artificial neural networks for entities and individuals about state-of-environment predicting concentrations of chemical elements in changes. various environments. One more important aspect of ecological researches As an example of the importance of this study, the relates to various statistical methods applied to process tendency of average median metal concentrations in collected data. Modern approaches to explore air moss (± one standard deviation) since 1990 to 2010 are pollutions provided by heavy metals, nitrogen, POPs and presented in Fig. 2. radionuclides include as a mandatory part multivariable statistical and intellectual data processing. Latest tendencies in data processing include extension of a set of georeferenced data that is integrated in data processing of surveyed data. So it is not limited by geographical, topographical or geological information, what is traditional in such cases, but also includes, for example, satellite imagery and their products, topographic high- precision data derived from aerial photography, etc. These new data classes, contrary to the traditional ones, are characterized by a high resolution and dynamic nature – for example, satellite images represents a reflection of solar radiation, which depends on the time of day, season, cloud cover, etc. This in turn greatly increases the amount of data to be processed. The task of Figure 2 Change in atmospheric deposition of elements integration of different types of data is tied to the problem in time. The black dots in the graphs show the decline in of the development of new models and algorithms – such deposition across Europe and blue dots as modeled by the as neural networks [6], self-organizing maps [7], etc. – Environmental Monitoring and Assessment Program during the study of dynamic properties of ecological processes among other things. So, one more aim of our project is to develop modern 3 Motivation and aim software tools for multivariable statistical and As discussed above, the ICP Vegetation programme intellectual data processing oriented on the GIS- is very important project, but it has a serious weakness technology. related to its weak adoption of modern informational technologies. There are dozens of respondents in existing 4 ICP vegetation data and required monitoring network and their number is increasing, but resources information on collecting and processing of samples is carried out manually or with minimum automation. Data The moss data are to be collected for about 50 mostly stored in xls files and aggregated manually by the countries in Europe, Asia and Central America. Each coordinator. Files from respondents are usually passed to country has more than 100 monitored points, and several the coordinator by email or by ordinary mail. There are hundred parameters must be taken into account for each no common standards in data transfer, storing and of them. A bulky archive is needed to perform processing software. Such situation does not meet the comparative studies and to estimate dynamics of modern standards for quality, effectiveness and speed of explored air pollution processes. Keeping in mind the research. Lack of a single web-platform that provides intensive data exchange and non-relational and poor- comprehensive solution of biological monitoring and structured character of data we can assess the size of our forecasting tasks is a major problem for research. database on the level of terabytes. Therefore the aim of the project is to create a cloud Thus it is necessary for scientists to manage large platform using modern analytical, statistical, amounts of data, and it leads to many non-trivial programmatic and organizational methods to provide the problems in IT field. It seems natural that a solution scientific community with unified system of gathering, should be centralized and outsourced to a cloud. storing, analyzing, processing, sharing and collective From a cloud point of view, the amount of data and usage of biological monitoring data. computing leads to data intensive processing. The platform elements are to facilitate IT-aspects of all biological monitoring stages starting from a choice of 5 Data management on the unified cloud collection places and parameters of samples description platform and finishing with generation of pollution maps of a particular area or state-of-environment forecast in the To optimize the whole procedure of data long term. Mechanisms and tools for association of management, it is proposed to build a unified platform participants of heterogeneous networks of biological consisting of a set of interconnected services and tools to monitoring are to be provided in the platform. That be developed, deployed and hosted in the JINR cloud [6]. 203 The JINR cloud currently has 400 CPU cores, 1000 TB pollution distributions and basic instruments to configure of RAM and about 30 ТB of total local disk space on the map are presented. cloud worker nodes for virtual machines and containers We have tried QGIS (Open Source Geographic deployment. Hosting services in the cloud allows scaling Information System [11]) and OpenLayers (opensource up and down cloud resources assign to the services javascript library to load, display and render maps from depending on theirs load. When some component will multiple sources) for regional and global maps require more resources cloud can provide it without representation. But QGIS and its web plugin is too hard affecting other components. This increases the efficiency to maintain and develop. Now we are using OpenLayers of hardware utilization as well as the reliability and [12] and some of its specific layers that allows to do basic availability of the service itself for the end-users. Such interpolation to create concentration maps. auto-scaling behavior will be achieved by using the Another interface to the platform is RESTful service OneFlow component of OpenNebula platform [6], which [13] that we are going to provide to the mobile and the JINR cloud is based on. desktop application and also for third-party services that We define requirements for the platform and specify can be interested in the environmental monitoring data. its components. The general architecture of the platform Data import and export mechanism will be available and technologies used are depicted in Fig. 3. for the platform, so users can process data online or upload it and use their local processing application. Intelligent multi-level statistical data processing is one of the platform important parts. We have tried several solutions but statistical and analytical packages are still under discussion. A very promising direction is the use of artificial neural network applications for predicting concentrations of chemical elements in various environments. We have done some research in this field but do not yet have the finished solution. 6 Prediction and GIS-oriented data processing Figure 3 General architecture of the platform and Prediction is an important step of data analysis of any technologies used ecological survey. Application of prediction methods enables mapping of estimate values. Maps in their turn We analyze data that comes from the contributors. provide visualization of spatial variability of data and can The data samples can have 10 to 40 metrics depending on be used for visual analysis so that ecological hazards can the collecting area. Most of the metrics are optional, so be identified [14]. traditional relation databases will be ineffective. We also Kriging is a widely-used interpolation technique used want to have a possibility to change structure of the data for prediction, e.g. concentration of heavy elements in sample object without hard code modification to easily moss [15], soil contamination [6]. Recently more and integrate new projects and experiments into the platform. more research is made towards integration of different We have a positive experience with MongoDB (open- data sources like aerial and satellite photography together source, document database designed for ease of with incorporation of new methods like artificial neural development and scaling [9]) at our previous projects networks. where more than 5 million data records from 200+ Mathematically, given a discrete function 𝑓𝑓(𝑥𝑥𝑖𝑖 , 𝑦𝑦𝑖𝑖 ) contributors are processed so it was decided to use the (response variable defined by measurements over data base to store sampling results. Cartesian coordinates) on an irregular grid of a set of The portal back-end will be built on Nginx (an open points 𝑉𝑉 = {(𝑥𝑥𝑖𝑖 , 𝑦𝑦𝑖𝑖 )}, an interpolation procedure finds source reverse proxy web server for HTTP protocol [10]) 𝑓𝑓̃(𝑥𝑥, 𝑦𝑦) for 𝑓𝑓 such that 𝑓𝑓̃(𝑥𝑥, 𝑦𝑦) is prediction for 𝑓𝑓 for and developed with PHP (widely-used open source ∀(𝑥𝑥, 𝑦𝑦) ∈ ℝ2 . Integration of data is done in such a way general-purpose scripting language that is especially that helps an interpolation procedure, like artificial neural suited for web development). That should provide network frameworks, to make a better predictor necessary performance and scalability. Web-portal with (interpolator). Such an approach is based on a conjecture responsive design that adjusts to different screen sizes is that neural networks are capable to employ hidden non- the main interface of the platform. The portal allows linear correlations that exist and hidden in the data. multilevel access to the data and has advanced data Formally, if compared to “classic” interpolation processing and reporting mechanisms. Currently basic where predictor variables are limited by Cartesian functionality of the portal has been implemented and coordinates, in this case a set of predictors is to be authorized users can manage their project/regions, import expanded with other predictor variables 𝑔𝑔1 (𝑥𝑥, 𝑦𝑦), data samples and generate regional maps. At top of Fig. 𝑔𝑔2 (𝑥𝑥, 𝑦𝑦),…, 𝑔𝑔𝑛𝑛 (𝑥𝑥, 𝑦𝑦). These can be topographical 4 one can see the interface for project management where features, elevation, products of aerial photography and contributor can add, delete, edit or copy the datasets. At satellite imagery and many more different surface bottom of the figure the map with the indication of properties. 204 Figure 4 Web-portal interfaces Thus the interpolation method is replaced in this variables, as aerial and satellite protography, but has way: given some problems with applying them to real-world data. 𝑓𝑓�𝑥𝑥𝑖𝑖 , 𝑦𝑦𝑖𝑖 , 𝑔𝑔1 (𝑥𝑥𝑖𝑖 , 𝑦𝑦𝑖𝑖 ), 𝑔𝑔2 (𝑥𝑥𝑖𝑖 , 𝑦𝑦𝑖𝑖 ), … , 𝑔𝑔𝑛𝑛 (𝑥𝑥𝑖𝑖 , 𝑦𝑦𝑖𝑖 )� = Kriging is oriented on data that is normally distributed. 𝑓𝑓(𝑥𝑥𝑖𝑖 , 𝑦𝑦𝑖𝑖 ) ∀(𝑥𝑥𝑖𝑖 , 𝑦𝑦𝑖𝑖 ) ∈ 𝑉𝑉, While it is somewhat true for lags of spatial coordinates build a predictor that are utilized in semivariogram and covariance 𝑓𝑓̃�𝑥𝑥, 𝑦𝑦, 𝑔𝑔1 (𝑥𝑥, 𝑦𝑦), 𝑔𝑔2 (𝑥𝑥, 𝑦𝑦), … , 𝑔𝑔𝑛𝑛 (𝑥𝑥, 𝑦𝑦)� function, it is not always true for secondary predictors 𝑔𝑔𝑖𝑖 . and establish an equality Also, it is problematic to construct a covariance function as it may naturally be anisotropic and even non- 𝑓𝑓̃(𝑥𝑥, 𝑦𝑦) = 𝑓𝑓̃(𝑥𝑥, 𝑦𝑦, 𝑔𝑔1 (𝑥𝑥, 𝑦𝑦),𝑔𝑔2 (𝑥𝑥, 𝑦𝑦),…,𝑔𝑔𝑛𝑛 (𝑥𝑥, 𝑦𝑦)). While symmetric. An artificial neural network, on the other extended form for 𝑓𝑓̃(𝑥𝑥, 𝑦𝑦) seems more complicated, it hand, automatically adapts to nonlinearities and non- simply allows an interpolation method to fuse in possible normally distributed variables. non-linear correlations and make “better” predictions, This approach also constitutes some problems which while other formal parts of the method stay the same. are inherent to artificial neural networks. As each A modification of kriging called cokriging has been concrete neural network is a product of a learning proposed [16]. It is oriented on using these secondary 205 procedure, some form of predictor evaluation has to be [2] Harmens H. and Mills G. (Eds.) Air Pollution: incorporated. Usually, several different learning Deposition to and impacts on vegetation in (South- procedures and network topologies are evaluated and East Europe, Caucasus, Central Asia (EECCA/SEE) results of interpolation procedure are analyzed for and South-East Asia. Report prepared by ICP deficiencies like overfitting. Such superprocedure Vegetation, March 2014. ICP Vegetation effectively increases computational costs and may be Programme Coordination Centre, Centre for sped up with parallel computing. Other problems are Ecology and Hydrology, Bangor, UK. ISBN: 978-1- caused by data specifics, so some approaches of 906698-48-5, 2014, 72p. regularization should be employed, like learning with Gaussian noise. [3] Harmens H., Norris D.A., Sharps K., Mills G. … Different types of satellite imagery are currently Frontasyeva M., et al. Heavy metal and nitrogen employed in data processing, like LandSat [17] and concentrations in mosses are declining across MODIS (the Moderate-resolution imaging Europe whilst some “hotspots” remain in 2010. spectroradiometer [18]). The latter project incorporates two satellites with spectroradiometers (hence the name) Environmental Pollution. 2015, 200:p. 93-104. that is able to take satellite imagery with high spectral http://dx.doi.org/10.1016/j.envpol.2015.01.036 resolution of 36 spectral bands. Whilst, if compared to [4] HEAVY METALS, NITROGEN AND POPs IN LandSat, it has moderate resolution (hence the name), it allows deeper and more thorough analyses of Earth EUROPEAN MOSSES: 2015 SURVEY surface, thus enabling interesting possibilities for http://icpvegetation.ceh.ac.uk/publications/docume research towards correlation and causality (e.g. nts/MossmonitoringMANUAL-2015-17.07.14.pdf contamination spreading catalysts and accelerants). [5] Buse A. et al. (2003). Heavy metals in European Using raw spectral radiation bands of spectral imagery is confronted with obvious interfering factors mosses: 2000/2001 survey. UNECE ICP Vegetation such as sun azimuth, time of day and season, surface Coordination Centre, Centre for Ecology and altitude and slope and other. Hydrology, Bangor, UK. Thus, in addition to the running standard statistical http://icpvegetation.ceh.ac.uk. procedures which calculated descriptive statistics and factor analysis, neural network data processing is [6] J. Alijagić, 2013. Application of multivariate considered to be used in the given project, together with statistical methods and artificial neural network for various MODIS products, as surface reflectance, land separation natural background and influence of surface temperature, land cover, vegetation indices, land mining and metallurgy activities on distribution of use, etc. chemical elements in the Stavnja valley (Bosnia and Herzegovina) : PhD thesis. University of Nova 7 Conclusion Gorica. The study of migration and accumulation of highly [7] Žibret, G., Šajn, R., 2010. Hunting for Geochemical toxic pollutants, which include heavy metals, persistent Associations of Elements: Factor Analysis and Self- organic pollutants and radionuclides, the influence of Organising Maps. Mathematical Geosciences, pollutants on the various components of the natural and 42(6): 681–703, doi:10.1007/s11004-010-9288-3. urban ecosystems is the key problem of modern http://link.springer.com/article/10.1007/s11004- biogeochemistry and ecology. The aim of the given 010-9288-3 project is to create cloud platform using modern analytical, statistical, programmatic and organizational [8] Kutovskiy N., Korenkov V., Balashov N., Baranov methods to provide the scientific community with unified A., Semenov R. JINR cloud infrastructure. Procedia system of collecting, analyzing and processing of Computer Science, ISSN: 1877-0509, Publisher: biological monitoring data. Elsevier. 2015, 66, p. 574-583. Parts of the project have already been implemented. The rest is going to be implemented in the next two years. [9] MongoDB site and description, URL: https://www.mongodb.com/ [10] Nginx for Windows URL: http://nginx.org/ru/docs/windows.html References [11] QGIS description URL: [1] United Nations Economic Commission for Europe http://www.qgis.org/en/site/ International Cooperative Programme on Effects of Air Pollution on Natural Vegetation and Crops [12] OpenLayers description URL: (http://icpvegetation.ceh.ac.uk/ http://docs.openlayers.org/ 206 [13] RESTful Web services: The basics URL: [16] H. Wackernagel Cokriging versus kriging in https://www.ibm.com/developerworks/library/ws- regionalized multivariate data analysis. Geoderma, restful/ 62 (1994) 83-92 [14] Goodchild M.F., Parks B.O., Steyaret L.T. 1993. [17] LandSat program description URL: Environmental modelling with GIS, Oxford http://yceo.yale.edu/what-landsat-program University Press, New York, 488 p. [15] S. Nickel et al. / Atmospheric Environment 99 [18] MODIS spectrometer on TERRA satellite URL: (2014) 85e93 http://modis.gsfc.nasa.gov/about/ 207