<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Majorov International Conference on Software Engineering and Computer Systems,
December</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Method for Environmental Monitoring in the Incomplete Data Conditions</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Nikita Tursukov</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ilya Viksnin</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Iuliia Kim</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Evgenii Neverov</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>ITMO University</institution>
          ,
          <addr-line>Saint-Petersburg</addr-line>
          ,
          <country country="RU">Russia</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2020</year>
      </pub-date>
      <volume>1</volume>
      <fpage>0</fpage>
      <lpage>11</lpage>
      <abstract>
        <p>In this paper, we propose a method for analyzing and processing incomplete data obtained in the environmental monitoring process. Incomplete and inaccurate data often occur during the operation of environmental monitoring sensors. As a result, these data contribute to the deterioration of the environmental pollution forecast. In the developed method, data is processed, analyzed, and then a model for predicting environmental pollution is generated. This approach is efective for applying to incorrect data, as it increases the accuracy of further forecasts. In this paper, we analyze various approaches to the prediction, and implement the appropriate method implemented using neural networks mechanisms.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Neural networks</kwd>
        <kwd>environmental pollution</kwd>
        <kwd>data forecasting</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>With the development of industrial enterprises production capacities, the pollutants
concentrations detection issue increases. In order to reduce the environmental risks, enterprises invest
in early warning systems. These systems, involve predicting the values of certain substances
concentrations at potentially dangerous objects. When the number of sensors collecting
information on the environmental condition increases, the issue of predicting the values when
data is incomplete arises. Due to the partial lack of information collected by the sensors, it
is impossible to accurately understand whether the local environmental situation is safe for
the ecosystem. At the same time, it is important to accurately determine the concentration of
potentially dangerous substances at critical infrastructure facilities, and not to confuse them
with other substances located within a certain area.</p>
      <p>In this paper, we propose a method that allows to analyze incomplete data on the
environmental condition, thereby increasing the accuracy of further forecasts. We start with the subject
area overview, in the next step a description of the approach is provided, than an empirical study
using real environmental monitoring data is conducted and the results obtained are described.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>In industrial facilities, it is crucial to accurately determine the concentration of potentially
dangerous substances and not confuse it with others located within the same enterprise area.
In other words, it is necessary to clearly distinguish dangerous substances from harmless
ones. Methods that use machine learning to analyze environmental parameters and provide
a concentration forecast for unreliable and incomplete data were proposed. There are classic
machine learning tasks that are usually applied for critical infrastructure facilities monitoring:
• clustering - determining how harmless a substance is, as well as to specify the release
source location;
• classification - determining the concentration increase possibility.</p>
      <p>An approach to assessing the environmental situation of various natural resources using
machine learning methods was demonstrated in [1].The article [2] predicts the level of the
territory contamination based on data obtained from several monitoring stations and transmitted
via the Internet of Things. For example, a classifier based on Bayesian networks was developed
to assess the probability of air pollution by PM2.5 particles. In [3] special attention was paid to
the air monitoring system in order to predict the appearance of pollutants based on retrospective
data. To perform this, the researchers tested three machine learning algorithms that predicted
an increase in the concentration of ground-level ozone, nitrogen dioxide, and sulfur dioxide.</p>
      <p>In most of the considered machine learning methods, classification is used to determine
whether the situation is critical. For instance, many projects create alarm systems that generate
a warning signal in case of detecting the state that is not regulated by the system [4]. Based on
the collected data, the model is trained, and the concentration thresholds are determined. If
such thresholds are exceeded, the alarm is activated.</p>
      <p>Most studies involve detecting critical situations on an object by performing a classification
task. Those methods use retrospective data for long-term forecasting, and do not consider
incomplete data that may prevent the detection of increased pollutant concentrations [5]. At
the same time, machine learning techniques are being increasingly used for detecting the a
contaminant appearance.</p>
      <p>The developed method involves the use of a neural network that eliminates the
incompleteness of the data. Further, using machine learning methods, a more accurate forecast of the
concentration of pollutants is made.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Materials and Methods</title>
      <p>To solve the mentioned problem, we propose to use regression models and neural networks
that allow analyzing time series containing information on the pollutant concentration level
in the environment, and other factors that may potentially afect its content. The regression
allows to analyze the time scale and allows to obtain approximate values for the pollutants
concentration. At the same time, the use of neural networks is gaining momentum, since they
can both classify the danger of a pollutant, and generate forecasts, considering the sensors’
location and the information collected by them.</p>
      <p>In contrast to the back propagation neural network, which is standard for solving prediction
problems, deep neural networks is considered for predicting data when processing long time
intervals [6]. Such networks form a directed sequence between elements, which allows to
process a series of events over time, and to link previous information to the current task.</p>
      <p>Software data analysis implementation is performed using the Python 3.7 and R programming
languages.</p>
    </sec>
    <sec id="sec-4">
      <title>4. The Environmental Monitoring Method</title>
      <p>The environmental monitoring method represents the order of actions and operations to be
performed with the input data. Input data, in general, are parameters obtained from the
sensors that collect data the environmental condition. Data is taken for a period of time that is
determined by the operator.</p>
      <p>As a result of data analysis, a forecast of the pollutant values for the time period  is obtained.
The forecast is both numerical concentration indicators and a graph that visualizes retrospective
data and data that is adjusted by the model.</p>
      <p>Initial data processing involves analyzing the data obtained in order to identify parameters
that afect the concentration indicators. Historical data is checked for correctness, by escaping of
abnormal data jumps, in order to more accurate further indicators prediction. A final data set is
generated, and predictors are selected-indicators that can afect the final predicted concentration
value of the predictor substance.</p>
      <p>In case of large variations in indicators, the collected time series data can be normalized for
more accurate analysis. Standard fields of the generated predictors and responses data set for
air analysis is described below:
• Date Time – date and time;
• WSW - wind speed (m/s);
• WDW - wind direction (degrees);
• Sigma – standard deviation of wind direction (degrees);
• Ambient Temp – temperature (degrees Celsius);
• Press - the atmospheric pressure (the atmosphere);
• Amb RH - relative humidity (%);
• NO - concentration of nitric oxide II (ppb);
• NO2 - concentration of nitric oxide IV (ppb);
• NOx - concentration of other nitrogen oxides (ppb);
• SO2 - concentration of sulfur oxide IV (ppb);
• CO-concentration of carbon monoxide (ppm);
• O3-ozone concentration (ppb-billionth part);
• PM10-class 10 ultrafine particle concentration (mcg/m3) ;
• PM2.5-concentration of ultrafine particles of class 2.5 (mcg/m3).</p>
      <p>
        Regression analysis is performed by constructing linear and logistic regression models,
described by the expression (
        <xref ref-type="bibr" rid="ref1">1</xref>
        ).
      </p>
      <p>
        =(b0+ ∑︁ (bixi) + ) ,
(
        <xref ref-type="bibr" rid="ref1">1</xref>
        )
where  is a continuous dependent variable; b0 is a free term of line assessment; bi is an angular
regression coeficient; xi - factors continuous model,  is a sigmoid function for implementing a
logistic regression model.
      </p>
      <p>
        The autoregressive model, in turn, can also be supplemented with logistic regression, and is
described by (
        <xref ref-type="bibr" rid="ref2">2</xref>
        ).
      </p>
      <p>t=b0+ ∑︁ (bixt− i) + t ,
where t is the series value at time ; b0-free term of line assessment; bi-angular regression
coeficient; xt− i - value of time series at time  − 1.</p>
      <p>
        To evaluate the constructed models, the following metrics were used:
• Mean Absolute Error (MAE);
• Mean Squared Error (MSE);
• Root Mean Squared Error (RMSE);
These metrics calculation is represented by (
        <xref ref-type="bibr" rid="ref3">3</xref>
        )-(
        <xref ref-type="bibr" rid="ref5">5</xref>
        ).
(
        <xref ref-type="bibr" rid="ref2">2</xref>
        )
(
        <xref ref-type="bibr" rid="ref3">3</xref>
        )
(
        <xref ref-type="bibr" rid="ref4">4</xref>
        )
(
        <xref ref-type="bibr" rid="ref5">5</xref>
        )
 = 1 ∑︁ |− ˆ| ,
      </p>
      <p>=1
 = 1 ∑︁ (− ˆ)2,
 =1
⎯
 =⎷⎸⎸ 1 ∑=︁1 (− ˆ)2,
where yi is the predicted value of the I-th ultrafine particle concentration indicator; ̂y︀i is the
real value of the i-th ultrafine particle concentration indicator.</p>
      <p>As a result, a timeline is formed with the results of the regression forecast, as well as the
necessary predictors that afect the concentration. Further data analysis and prediction is
performed using the recurrent neural network Long-short term memory (LSTM). LSTM is
able to identify significant information when processing suficiently long time intervals and
sequences [7]. This is most efective for working with incomplete data in order to restore and
include it in a further forecasting task.</p>
      <p>
        The operation of a recurrent neural network is described by (
        <xref ref-type="bibr" rid="ref6">6</xref>
        ).
      </p>
      <p>
        ℎt=fw (ℎt− 1, xt) ,
(
        <xref ref-type="bibr" rid="ref6">6</xref>
        )
where ℎt is the new state that the data processing unit outputs; fw is the processing function
with parameters ; ℎt− 1 is the state obtained from the previous step; xt is the incoming data.
      </p>
      <p>As a result of constructing recurrent neural network LSTM model, a graph is generated that
displays the retrospective pollutant indicators and the predicted ones. The correctness of the
model’s operation is evaluated using evaluation metrics, such as: MAE, MSE, RMSE.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Empirical Study</title>
      <p>To conduct an empirical study, we analyzed data from open sources on the environmental
condition. Data collected by the Stoke Hills station in Darwin, Australia, was selected for the
present study. According to open data of the Northern territory of Australia environmental
protection ofice, excess of the PM10 and PM2.5 particles number is observed in the air at this
site.</p>
      <p>The choice of data depended on the location of the monitoring station. The data used in the
experiment were collected near the coal transportation station. This allowed to record a large
number of concentration spikes in the test data set, as well as data losses due to sensor failures.</p>
      <p>Data collected by the station include meteorological: wind direction and speed, temperature,
pressure and humidity, and the concentration of particles (PM10 and PM2.5). Data is collected
every hour. For the sample, we took data for a year (∽9000 indicators). An example of a
concentration display graph is shown in Figure 1.</p>
    </sec>
    <sec id="sec-6">
      <title>6. Results</title>
      <p>of predictors describing temperature and humidity positively afects the determination adjusted
coeficient value, which was used for data processing.In addition to temperature and humidity,
the dependence of the concentration of substances on the seasons was revealed. This allows to
more efectively use the LSTM network to restore data.</p>
      <p>Thus, using metric estimates, the most successful sets of input data were selected, including
predictors necessary for forecasting. Further evaluation is performed after the implementation
of the prediction model via neural networks. The estimation is performed both by analyzing
the results metrics and using graphs, comparing retrospective and predicted data.</p>
      <p>If the problem of incomplete data occurs, when factors afecting the polluting parameter
cannot be considered, the predicted data should be brought closer to the actual one. To perform
this, the timeline is modeled on more retrospective information. Figures 4-5 show graphs of
fitting and predicting concentrations over a time series, obtained via the recurrent neural network
(LSTM) model. Network training and further prediction were performed on a concentration
data set, which was the only input parameter.</p>
    </sec>
    <sec id="sec-7">
      <title>7. Discussion</title>
      <p>The results obtained using the data analysis method show that the generated pollutant
concentration forecast is close to real values. In addition, the data obtained allow to analyze
future deviations in the substances concentration over long time periods. However, incomplete
concentration data can be restored based on retrospective measurements.</p>
      <p>Since the forecast accuracy were stable even in the incomplete data conditions, the proposed
method allows to being implemented with the systems where sensors may fail due to diferent
technical problems or malfunctions.</p>
    </sec>
    <sec id="sec-8">
      <title>8. Conclusion</title>
      <p>In this paper, we proposed and implemented a method for data processing and analysis that
allows to predict deviations in the pollutants content, in the unreliable and incomplete data
conditions. The method was implemented using the R and Python 3.7 programming languages,
and was tested on real data on the environmental conditions obtained from public sources.
The data was inaccurate and contained omissions in the measurements. Using the developed
method, the missing data was restored, as well as the necessary parameters were evaluated and
selected, on the basis of which the data forecast was performed. The predicted concentration
values were close to the actual data. The industrial enterprises can benefit from implementing
of such approach, where it is necessary to correctly predict the pollutants concentration in the
atmosphere, since the proposed method allows to eficiently process the data that might be
damaged or inaccurate.</p>
    </sec>
    <sec id="sec-9">
      <title>Acknowledgements</title>
      <p>This paper is supported by the Government of Russian Federation (grant 08-08).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Pandey</surname>
            <given-names>S. K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kim</surname>
            <given-names>K. H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tang</surname>
            <given-names>K. T.</given-names>
          </string-name>
          <article-title>A review of sensor-based methods for monitoring hydrogen sulfide //TrAC Trends in Analytical Chemistry</article-title>
          .
          <article-title>-</article-title>
          <year>2012</year>
          . - pp.
          <fpage>87</fpage>
          -
          <lpage>99</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Chiwewe</surname>
            <given-names>T. M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ditsela</surname>
            <given-names>J.</given-names>
          </string-name>
          <article-title>Machine learning based estimation of Ozone using spatio-temporal data from air quality monitoring stations //2016</article-title>
          <source>IEEE 14th International Conference on Industrial Informatics (INDIN)</source>
          .
          <source>- IEEE</source>
          ,
          <year>2016</year>
          . - pp.
          <fpage>58</fpage>
          -
          <lpage>63</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Shaban</surname>
            <given-names>K. B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kadri</surname>
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rezk</surname>
            <given-names>E.</given-names>
          </string-name>
          <article-title>Urban air pollution monitoring system with forecasting models //IEEE Sensors Journal</article-title>
          .
          <article-title>-</article-title>
          <year>2016</year>
          . -
          <fpage>№</fpage>
          . 8. - pp.
          <fpage>2598</fpage>
          -
          <lpage>2606</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>C.</given-names>
            <surname>Kühnerta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Bernarda</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I. Montalvo</given-names>
            <surname>Arango</surname>
          </string-name>
          , R. Nitsche, “
          <article-title>Water Quality Supervision of Distribution Networks Based on Machine Learning Algorithms</article-title>
          and Operator Feedback” // Procedia Engineering,
          <volume>89</volume>
          ,
          <year>2014</year>
          , pp.
          <fpage>189</fpage>
          -
          <lpage>196</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Bianchi</surname>
            <given-names>F. M.</given-names>
          </string-name>
          et al.
          <article-title>Recurrent neural networks for short-term load forecasting: an overview and comparative analysis</article-title>
          . - Springer,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>C.</given-names>
            <surname>Plant</surname>
          </string-name>
          , C. Böhm, “INCONCO:
          <article-title>Interpretable clustering of numerical and categorical objects” //</article-title>
          <source>Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining</source>
          ,
          <year>2011</year>
          , pp.
          <fpage>1127</fpage>
          -
          <lpage>1135</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>K.</given-names>
            <surname>Frederix</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. V.</given-names>
            <surname>Barel</surname>
          </string-name>
          , “
          <article-title>Sparse spectral clustering method based on the incomplete Cholesky decomposition” //</article-title>
          <source>Journal of Computational and Applied Mathematics</source>
          ,
          <volume>237</volume>
          (
          <issue>1</issue>
          ),
          <year>2013</year>
          , pp.
          <fpage>145</fpage>
          -
          <lpage>161</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>