Method for Environmental Monitoring in the
Incomplete Data Conditions
Nikita Tursukova , Ilya Viksninb , Iuliia Kimc and Evgenii Neverovd
ITMO University, Saint-Petersburg, Russia


                                      Abstract
                                      In this paper, we propose a method for analyzing and processing incomplete data obtained in the en-
                                      vironmental monitoring process. Incomplete and inaccurate data often occur during the operation of
                                      environmental monitoring sensors. As a result, these data contribute to the deterioration of the envi-
                                      ronmental pollution forecast. In the developed method, data is processed, analyzed, and then a model
                                      for predicting environmental pollution is generated. This approach is effective for applying to incorrect
                                      data, as it increases the accuracy of further forecasts. In this paper, we analyze various approaches to the
                                      prediction, and implement the appropriate method implemented using neural networks mechanisms.

                                      Keywords
                                      Neural networks, environmental pollution, data forecasting


1. Introduction
With the development of industrial enterprises production capacities, the pollutants concentra-
tions detection issue increases. In order to reduce the environmental risks, enterprises invest
in early warning systems. These systems, involve predicting the values of certain substances
concentrations at potentially dangerous objects. When the number of sensors collecting in-
formation on the environmental condition increases, the issue of predicting the values when
data is incomplete arises. Due to the partial lack of information collected by the sensors, it
is impossible to accurately understand whether the local environmental situation is safe for
the ecosystem. At the same time, it is important to accurately determine the concentration of
potentially dangerous substances at critical infrastructure facilities, and not to confuse them
with other substances located within a certain area.
   In this paper, we propose a method that allows to analyze incomplete data on the environ-
mental condition, thereby increasing the accuracy of further forecasts. We start with the subject
area overview, in the next step a description of the approach is provided, than an empirical study
using real environmental monitoring data is conducted and the results obtained are described.


Proceedings of the 12th Majorov International Conference on Software Engineering and Computer Systems,
December 10–11, 2020, Online & Saint Petersburg, Russia
" stepingnik@gmail.com (N. Tursukov); wixnin@mail.ru (I. Viksnin); yulia1344@gmail.com (I. Kim);
datnever@ya.ru (E. Neverov)
 0000-0003-3848-1981 (N. Tursukov); 0000-0002-3071-6937 (I. Viksnin); 0000-0002-6951-1875 (I. Kim);
0000-0003-0733-1294 (E. Neverov)
                                    © 2020 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
 CEUR
 Workshop
 Proceedings
               http://ceur-ws.org
               ISSN 1613-0073
                                    CEUR Workshop Proceedings (CEUR-WS.org)
2. Related Work
In industrial facilities, it is crucial to accurately determine the concentration of potentially
dangerous substances and not confuse it with others located within the same enterprise area.
In other words, it is necessary to clearly distinguish dangerous substances from harmless
ones. Methods that use machine learning to analyze environmental parameters and provide
a concentration forecast for unreliable and incomplete data were proposed. There are classic
machine learning tasks that are usually applied for critical infrastructure facilities monitoring:

    • clustering - determining how harmless a substance is, as well as to specify the release
      source location;
    • classification - determining the concentration increase possibility.

   An approach to assessing the environmental situation of various natural resources using
machine learning methods was demonstrated in [1].The article [2] predicts the level of the
territory contamination based on data obtained from several monitoring stations and transmitted
via the Internet of Things. For example, a classifier based on Bayesian networks was developed
to assess the probability of air pollution by PM2.5 particles. In [3] special attention was paid to
the air monitoring system in order to predict the appearance of pollutants based on retrospective
data. To perform this, the researchers tested three machine learning algorithms that predicted
an increase in the concentration of ground-level ozone, nitrogen dioxide, and sulfur dioxide.
   In most of the considered machine learning methods, classification is used to determine
whether the situation is critical. For instance, many projects create alarm systems that generate
a warning signal in case of detecting the state that is not regulated by the system [4]. Based on
the collected data, the model is trained, and the concentration thresholds are determined. If
such thresholds are exceeded, the alarm is activated.
   Most studies involve detecting critical situations on an object by performing a classification
task. Those methods use retrospective data for long-term forecasting, and do not consider
incomplete data that may prevent the detection of increased pollutant concentrations [5]. At
the same time, machine learning techniques are being increasingly used for detecting the a
contaminant appearance.
   The developed method involves the use of a neural network that eliminates the incomplete-
ness of the data. Further, using machine learning methods, a more accurate forecast of the
concentration of pollutants is made.


3. Materials and Methods
To solve the mentioned problem, we propose to use regression models and neural networks
that allow analyzing time series containing information on the pollutant concentration level
in the environment, and other factors that may potentially affect its content. The regression
allows to analyze the time scale and allows to obtain approximate values for the pollutants
concentration. At the same time, the use of neural networks is gaining momentum, since they
can both classify the danger of a pollutant, and generate forecasts, considering the sensors’
location and the information collected by them.
   In contrast to the back propagation neural network, which is standard for solving prediction
problems, deep neural networks is considered for predicting data when processing long time
intervals [6]. Such networks form a directed sequence between elements, which allows to
process a series of events over time, and to link previous information to the current task.
   Software data analysis implementation is performed using the Python 3.7 and R programming
languages.


4. The Environmental Monitoring Method
The environmental monitoring method represents the order of actions and operations to be
performed with the input data. Input data, in general, are parameters obtained from the
sensors that collect data the environmental condition. Data is taken for a period of time that is
determined by the operator.
   As a result of data analysis, a forecast of the pollutant values for the time period 𝑛 is obtained.
The forecast is both numerical concentration indicators and a graph that visualizes retrospective
data and data that is adjusted by the model.
   Initial data processing involves analyzing the data obtained in order to identify parameters
that affect the concentration indicators. Historical data is checked for correctness, by escaping of
abnormal data jumps, in order to more accurate further indicators prediction. A final data set is
generated, and predictors are selected-indicators that can affect the final predicted concentration
value of the predictor substance.
   In case of large variations in indicators, the collected time series data can be normalized for
more accurate analysis. Standard fields of the generated predictors and responses data set for
air analysis is described below:
    • Date Time – date and time;
    • WSW - wind speed (m/s);
    • WDW - wind direction (degrees);
    • Sigma – standard deviation of wind direction (degrees);
    • Ambient Temp – temperature (degrees Celsius);
    • Press - the atmospheric pressure (the atmosphere);
    • Amb RH - relative humidity (%);
    • NO - concentration of nitric oxide II (ppb);
    • NO2 - concentration of nitric oxide IV (ppb);
    • NOx - concentration of other nitrogen oxides (ppb);
    • SO2 - concentration of sulfur oxide IV (ppb);
    • CO-concentration of carbon monoxide (ppm);
    • O3-ozone concentration (ppb-billionth part);
    • PM10-class 10 ultrafine particle concentration (mcg/m3) ;
    • PM2.5-concentration of ultrafine particles of class 2.5 (mcg/m3).
  Regression analysis is performed by constructing linear and logistic regression models,
described by the expression (1).
                                          ∑︁
                                 𝑦=𝑔(b0 +    (bi xi ) + 𝜀) ,                         (1)
where 𝑦 is a continuous dependent variable; b0 is a free term of line assessment; bi is an angular
regression coefficient; xi - factors continuous model, 𝑔 is a sigmoid function for implementing a
logistic regression model.
  The autoregressive model, in turn, can also be supplemented with logistic regression, and is
described by (2).                             ∑︁
                                     𝑥t =b0 +    (bi xt−i ) + 𝜀t ,                             (2)
where 𝑥t is the series value at time 𝑡; b0 -free term of line assessment; bi -angular regression
coefficient; xt−i - value of time series at time 𝑡 − 1.
  To evaluate the constructed models, the following metrics were used:

    • Mean Absolute Error (MAE);
    • Mean Squared Error (MSE);
    • Root Mean Squared Error (RMSE);

These metrics calculation is represented by (3)-(5).
                                                 𝑛
                                          1 ∑︁
                                    𝑀 𝐴𝐸=      |𝑦𝑖 −𝑦ˆ𝑖 | ,                                    (3)
                                          𝑛
                                                𝑖=1

                                                 𝑛
                                         1 ∑︁
                                   𝑀 𝑆𝐸=       (𝑦𝑖 −𝑦ˆ𝑖 )2 ,                                   (4)
                                         𝑛
                                           𝑖=1
                                        ⎯
                                        ⎸ 𝑛
                                        ⎸ 1 ∑︁
                                  𝑅𝑀 𝑆𝐸=⎷        (𝑦𝑖 −𝑦ˆ𝑖 )2 ,                                 (5)
                                           𝑛
                                                     𝑖=1

where yi is the predicted value of the I-th ultrafine particle concentration indicator; ̂︀
                                                                                        yi is the
real value of the i-th ultrafine particle concentration indicator.
  As a result, a timeline is formed with the results of the regression forecast, as well as the
necessary predictors that affect the concentration. Further data analysis and prediction is
performed using the recurrent neural network Long-short term memory (LSTM). LSTM is
able to identify significant information when processing sufficiently long time intervals and
sequences [7]. This is most effective for working with incomplete data in order to restore and
include it in a further forecasting task.
  The operation of a recurrent neural network is described by (6).

                                       ℎt =fw (ℎt−1 , xt ) ,                                   (6)

where ℎt is the new state that the data processing unit outputs; fw is the processing function
with parameters 𝑤; ℎt−1 is the state obtained from the previous step; xt is the incoming data.
   As a result of constructing recurrent neural network LSTM model, a graph is generated that
displays the retrospective pollutant indicators and the predicted ones. The correctness of the
model’s operation is evaluated using evaluation metrics, such as: MAE, MSE, RMSE.
5. Empirical Study
To conduct an empirical study, we analyzed data from open sources on the environmental
condition. Data collected by the Stoke Hills station in Darwin, Australia, was selected for the
present study. According to open data of the Northern territory of Australia environmental
protection office, excess of the PM10 and PM2.5 particles number is observed in the air at this
site.
   The choice of data depended on the location of the monitoring station. The data used in the
experiment were collected near the coal transportation station. This allowed to record a large
number of concentration spikes in the test data set, as well as data losses due to sensor failures.
   Data collected by the station include meteorological: wind direction and speed, temperature,
pressure and humidity, and the concentration of particles (PM10 and PM2.5). Data is collected
every hour. For the sample, we took data for a year (∽9000 indicators). An example of a
concentration display graph is shown in Figure 1.


Figure 1: PM10 concentration.


6. Results
Figure 2 shows the retrospective data collected for PM10 in the air using a timeline. At the same
time, there is a gap in the collected data that needs to be filled with data that is close to real
values in order to make further predictions more accurate.
   To solve the prediction problem, we use a model built by the LSTM neural network. For
instance, Figure 3 graphically shows the results of incomplete data recovery, as well as its
further prediction.
   A regression analysis was performed, during which the response and predictors were trans-
formed. Some of the results of the regression analysis are shown in Table 1.
   As a result of analyzing the data set used in the experiment, it was found that the combination
Figure 2: PM10 concentration.


Figure 3: Recovered data.


of predictors describing temperature and humidity positively affects the determination adjusted
coefficient value, which was used for data processing.In addition to temperature and humidity,
the dependence of the concentration of substances on the seasons was revealed. This allows to
more effectively use the LSTM network to restore data.
   Thus, using metric estimates, the most successful sets of input data were selected, including
predictors necessary for forecasting. Further evaluation is performed after the implementation
Table 1
Results of regression analysis.
                                                                       The value of metrics
      Regression       Predictors and response transformations
                                                                 R-squared RMSE           p-val
    Autoregression        PM10(t-1),PM10(t-2),T(t-1),RH(t- 1)      0.686       8.275 <2.2e-16
    Autoregression         PM10(t-1),PM10(t- 2),PM10(t-3)           0.66        8.6    <2.2e-16
        Linear                        PM10(t-2)                    0.1572      14.2    <2.2e-16
        Linear              RH, T, direction,log(PM10+2)           0.137        0.6    <2.2e-16


of the prediction model via neural networks. The estimation is performed both by analyzing
the results metrics and using graphs, comparing retrospective and predicted data.
   If the problem of incomplete data occurs, when factors affecting the polluting parameter
cannot be considered, the predicted data should be brought closer to the actual one. To perform
this, the timeline is modeled on more retrospective information. Figures 4-5 show graphs of fit-
ting and predicting concentrations over a time series, obtained via the recurrent neural network
(LSTM) model. Network training and further prediction were performed on a concentration
data set, which was the only input parameter.


Figure 4: The fit of the model.


7. Discussion
The results obtained using the data analysis method show that the generated pollutant con-
centration forecast is close to real values. In addition, the data obtained allow to analyze
future deviations in the substances concentration over long time periods. However, incomplete
concentration data can be restored based on retrospective measurements.
Figure 5: A certain vector of concentration growth.


   Since the forecast accuracy were stable even in the incomplete data conditions, the proposed
method allows to being implemented with the systems where sensors may fail due to different
technical problems or malfunctions.


8. Conclusion
In this paper, we proposed and implemented a method for data processing and analysis that
allows to predict deviations in the pollutants content, in the unreliable and incomplete data
conditions. The method was implemented using the R and Python 3.7 programming languages,
and was tested on real data on the environmental conditions obtained from public sources.
The data was inaccurate and contained omissions in the measurements. Using the developed
method, the missing data was restored, as well as the necessary parameters were evaluated and
selected, on the basis of which the data forecast was performed. The predicted concentration
values were close to the actual data. The industrial enterprises can benefit from implementing
of such approach, where it is necessary to correctly predict the pollutants concentration in the
atmosphere, since the proposed method allows to efficiently process the data that might be
damaged or inaccurate.


Acknowledgements
This paper is supported by the Government of Russian Federation (grant 08-08).
References
[1] Pandey S. K., Kim K. H., Tang K. T. A review of sensor-based methods for monitoring
    hydrogen sulfide //TrAC Trends in Analytical Chemistry. – 2012. – pp. 87-99.
[2] Chiwewe T. M., Ditsela J. Machine learning based estimation of Ozone using spatio-temporal
    data from air quality monitoring stations //2016 IEEE 14th International Conference on
    Industrial Informatics (INDIN). – IEEE, 2016. – pp. 58-63.
[3] Shaban K. B., Kadri A., Rezk E. Urban air pollution monitoring system with forecasting
    models //IEEE Sensors Journal. – 2016. – №. 8. – pp. 2598-2606.
[4] C. Kühnerta, T. Bernarda, I. Montalvo Arango, R. Nitsche, “Water Quality Supervision of
    Distribution Networks Based on Machine Learning Algorithms and Operator Feedback” //
    Procedia Engineering, 89, 2014, pp. 189-196.
[5] Bianchi F. M. et al. Recurrent neural networks for short-term load forecasting: an overview
    and comparative analysis. – Springer, 2017.
[6] C. Plant, C. Böhm, “INCONCO: Interpretable clustering of numerical and categorical objects”
    // Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery
    and Data Mining, 2011, pp. 1127-1135.
[7] K. Frederix, M. V. Barel, “Sparse spectral clustering method based on the incomplete
    Cholesky decomposition” // Journal of Computational and Applied Mathematics, 237(1),
    2013, pp. 145-161.