1. Introduction

GreenEyes: An Air Quality Evaluating Model based on WaveNet

Kan Huang

kan.huang@connect.ust.hk 0 2

Kai Zhang

0 1

Ming Liu

eelium@ust.hk 0 2 0 AMLTS'22: Workshop on Applied Machine Learning Methods for Time Series Forecasting, co-located with the 31st ACM International Conference on Information and Knowledge Management , CIKM 1 Lehigh University , 27 Memorial Dr W, Bethlehem, PA , United States 2 The Hong Kong University of Science and Technology , Clearwater Bay, Hong Kong , China

Accompanying rapid industrialization, humans are sufering from serious air pollution problems. The demand for air quality prediction is becoming more and more important to the government's policy-making and people's daily life. In this paper, We propose GreenEyes - a deep neural network model, which consists of a WaveNet-based backbone block for learning representations of sequences and an LSTM with a Temporal Attention module for capturing the hidden interactions between features of multi-channel inputs. To evaluate the efectiveness of our proposed method, we carry out several experiments including an ablation study on our collected and preprocessed air quality data near HKUST. The experimental results show our model can efectively predict the air quality level of the next timestamp given any segment of the air quality data from the data set. We have also released our standalone dataset at this URL. The model and code for this paper are publicly available at this URL.

eol>deep learning neural networks fitting model regression analysis AIoT

1. Introduction

tion scenarios [ 3, 4, 5 ]. For instance, Ray et al. [ 6 ] built a smart air-borne PM2.5 density monitoring system based With the development of the global economy and indus- on the cloud platform. However, these systems simply trialization, people’s living standards have improved, in execute quality detection tasks without considering futhe meanwhile, environmental problems such as air pol- ture air quality to let the purifier intelligently control its lution have become a big concern. As World Health Or- power level for energy-saving purposes. To bridge this ganization (WHO) stated [ 1 ], air pollution is the world’s gap, we propose the GreenEyes framework to predict the largest environmental health risk, which will incur many trend with previous air pollution levels. The feedback diseases including but not limited to respiratory infec- 反co馈nt控ro制ls流y程stem can be illustrated as Figure 1 shows. tions, heart disease, COPD, stroke, and lung cancer.

Among all kinds of pollution, air pollution has the largest Sensing feedbacks impact on premature deaths annually [ 2 ]. Hence, as people’s awareness of health increases, more and more smart GreenEyes devices such as smart bands have been developed and AIoT system eaqsumipaprte din,dwohoirchaicrapnurreifieprocrtanairauqtuoamlitaytisctaaltlyusp. uMriofryetohveer, PMd2a.t5a/10 uSneints(ien.gg. dceovmicpeu(tien.gg. GPrreeedniEctyoers dCeocnistrioolns C(ofanntr,oplulerdifiUern)it air when the resident is not at home. stm32) iOS mobile) Model

The air pollution problem is widely discussed in the ifeld of Artificial Intelligence of Things (AIoT) and Sensing Networks. Some IoT systems with variant functions are designed to monitor air quality for diferent applica- Figure 1: GreenEyes: AIoT deployment.

In this work, we firstly investigate the problem of preprocessing noisy PM2.5 sequence data and creating an appropriate supervising target sequence. We implement the GreenEyes model to predict the future air quality and evaluate it on each channel of PM2.5 data. Besides, we train our model with all channels’ data together. Other works either use diferent kinds of data [ 7 ], or use sensors of the same model but place them at diferent places © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License [ 6 ]. The former methodology is Multi-sensor Fusion [ 8 ], CPWrEooUrckReshdoinpgs IhStpN:/c1e6u1r3-w-0s.o7r3g ACttEribUutRion W4.0oInrtekrnsahtioonpal (PCCroBYce4.0e).dings (CEUR-WS.org) it is widely used in the intelligent and autonomous sys• We treat WaveNet’s residual layers as a feature block. This idea comes from the basic structures such as convolution-activation-pooling in computer vision. Such a design can increase reception ifled and learn better representations. • We innovatively stack several WaveNet blocks to build the model’s main body. As the basic mechanism of deep learning networks is to build models brick by brick, the same module with diferent parameters is usually used in the same model. We borrow this idea and make it possible to parameterize our model. The model’s optimal hyperparameters such as depth and filters can also be ifne-tuned easily. • We put Attention [13] and LSTM [14] at the endpoint as output layers. Ablation experiments demonstrate its necessity because this module can capture the hidden interations between features of diferent sequences (channels).

2. Datasets

AQI (Air Quality Index) is widely used for measuring the current pollution status of the air. IAQIs (Individual Air Quality Index) are calculated according to pollutants such as ozone, nitrogen dioxide, sulphur dioxide, and others, before final AQI is concluded. In our work, the IAQI of 2.5 is considered.

IAQI level data calculated from raw air quality data of sensors cannot be used directly because of highfrequency noise. As Figure 3 presents, in some intervals of the time axis, the IAQI level fluctuates very fast. This is because the air quality data is exactly fluctuating around the threshold line. In real AIoT applications, we don’t need this fluctuation. Image the following module is a fan switch that takes the model’s output to determine and we want this output to be relatively stable. In order to clean the data fluctuation while keeping the trend features, we innovatively brought out a method of human manually labeling. It creates an appropriated target label function that the model can learn. Also, based on the labeling tricks, the problem that the predictions on the IAQI level will fluctuate near the thresholds is much reduced.

2.1. Data Collection We placed our 4 sensors in an ofice room located inside

the Academic Building of the HKUST. The room is inside tems [ 9, 10, 11, 12 ]. However, our approach of experiment the academic building and has no windows, it provides proves that multi-sensors (of the same model, at the same a stable experimental environment for temperature and place) will make the model perform better in predicting humidity. The sampling rate of the sensor is 1 Hz. We target data. simultaneously collected around 220k data points for

The main characteristics of this paper are summarized each sensor in a continuous period starting from 20:28 as follows: on 25th November 2019. This period is about 2 days and a half or 61 hours.

2.2. IAQI Calculation

The final AQI depends on each pollutant’s IAQI, which is calculated by Equation 1 =

− − (− )+, (1) and finally, AQI is calculated by Equation 2 = max{1, 2, 3, ..., }. (2)

In this paper, we only concern and discuss on the IAQI regarding 2.5.

Above equations about IAQI and AQI are universal for multi kinds of air pollution standards. Diferent thresholds are used when mapping air pollutants data into IAQI in diferent standards. Table 1 lists 2.5 and 10 IAQI thresholds in China’s and USA’s standards respectively. In this paper, we use the USA standard.

2.3. Data Polynomialization

The task of our model is to predict the IAQI level when inputting a segment of air pollutant concentration data. However, the origin IAQI level lines cannot be directly used because 1. in deep learning, a step function is very hard to learn especially on the rising and falling edges; where is the slope of the curve, and +1 are start and end time point for every interval of the polygonal line. When > 0, the trend of the IAQI level is raising, and vice versa. The absolute value of is the approximate and potential changing speed of IAQI level. Thus, every polygonal line can be divided into several segments within time interval to +1, and every segment estimates the 1-order approximate trend w.r.t. original IAQI (3) level within corresponding time interval. For the i-th segment, is such that () is linear on each interval [− 1, ].

Polygonal functions can be used to generate approximations to known curves, planes, etc. Also, for unknown data, polygonal functions can also be learned by some algorithms such as decision tree, to fit the data. In our work of predicting, polygonal functions help us to eliminate the hesitation area, and build the target data.

2.4. Data Polygonalization: Human Labeling based on Decisions

= +1 − . +1 −

(5) where +1 and are the original IAQI levels at end and start time +1 and .

Our experiments will take these polygonalized IAQI level lines as the supervising data. The fitting problem can be described as: given a IAQI sequence of windows size, predict the IAQI level of the next time frame after this time window.

We firstly label by hand the level step downup points, and map them into risingfalling lines. This method transfer discrete decision points into continuous target data series Recently, a series of neural networks related to the autowhich have the same dimension as the time indices and regression model has been proposed and applied in recorresponding 2.5 data. This kind of method make garding problems. DeepMind’s WaveNet [16] is one of us get the polygonal target data as B. Rouet-Leduc, et al. the famous and foundation work in between those [17], [15] did. Figure 3 shows our labeling results. [18] tackle with sequence representation and generating.

3. Methodology

auto-regression tasks where sequences on all the tem- same kernel size of 3, and filters of 16. This set of hyper (x) = ∏︁ {|1, 2, ..., − 1}

(6)

Auto-regression models can not only be used in data generation, but also in time series prediction. In our work, every sample and at any time step is conditioned on the samples at all previous timestamps, making it a multivariate auto-regression task. To limit the input length, we only consider the conditional probabilities between and a sequence − 1− _:− 1 with length _. Diferent with other multivariate only with − 1− _:− 1

. poral axis are modeled, we haven’t used the sequence − 1− _:− 1 to predict , instead, we predict

Diferent from B. Rouet-Leduc’s work[ 15], in which random forest is used to predict seismic precursors, we use WaveNet as our GreenEyes model’s main part. Air pollution data has the same structure as audio data. It is pretty suitable to utilize WaveNet as air pollution data can be modeled in the same way. Also, WaveNet’s dilated causal convolutions and residual and skip connections are suitable for air pollution data.

We used the original WaveNet’s core part as a WaveNet Block as we believe this blockstyle configuration is more

Attention Q V ×

Target y parameters more easily. Each WaveNet Block, as the same with WaveNet, contains several dilated convolution layers, called WaveNet Layer. Diferent dilation rates are

The designing of neural networks for deep learning has

always followed principles such as modularization, and expandability. Well-known networks, such as VGG [19] and ResNet [20], all have these features. VGG has two model types VGG16, and VGG19, with diferent model depth. And ResNet has models ResNet-18, ResNet-34, ResNet-50, etc. The cutting-edge model, Transformer [21], also obeys these designs which makes it possible to build multi variant models for various sizes and application scenarios. Our model is designed for parameterization, too. Following our constructions, finally we set 8 WaveNet layers for the first block; and 5 layers for the second, 3 layers for the third. All blocks share the parameters are chosen by empiric and the computational capability of a 1080 Ti GPU. There might be more optimal parameters to search in future works.

As for the Attention layer, we set up two kinds of

Attention mechanism - Dot-product attention layer, a.k.a. Luong-style attention [22], as Equation 8 shows. We use the input for all value vector, key vector, and query input. Another mechanism is made by ourselves, called

Temporal Attention.

scores = Attention(, , ) = softmax( ) (7)

In our Attention layer, we still use the Luong’s multiplicative style attention 9 to gain score, but we simply it with a FC network. Moreover, we don’t use softmax function to compute the attention weight. Rather, we use the function as Equation 11 shows.

scores(h¯ , h¯) = h¯ W h¯

scores = W V + b Attention( ) = exp(tanh(scores)) (9) (10) (11)

The reason that we replace the softmax with a tanh function followed by an exponential function, is to better adapt our model to the temporal data set. Our data set have many temporal and periodic features to learn. Tanh function is very common in sequential models, and it is also a component in every WaveNet layer.

4. Experiments 4.1. Experimental Settings

As we sampled 2.5 measurements from 4 sensors, Sensor 0 to Sensor 3, so we have a 4-channels 2.5 IAQI data set. Each channel’s data can be taken as an individual data set. The stride is set as {10, 5, 2}, respectively. Besides, we fuse data from all channels to create a new data set named 2.5.

Adam[23] optimizer with an initial learning rate 0.0001 is applied in the experiments, which is multiplied by 0.1 after 20 epochs, where the total training epoch is 100. We use mean squared error (MSE) and mean absolute error (MAE) as the evaluation metrics.

4.2. Training and Validation

4.2.1. Why did We Redesign the Attention Layer?

At first, we utilized the dot-product attention layer pro

vided by TensorFlow oficial. Table 2 lists all the experiments’ final best metrics during training.

After we train the model with Temporal Attention, we discovery that the results on oficial Attention show limitations and defects. As Table 3 shows, in most experiments, Temporal Attention outperforms oficial Attention. When we plot the validation curve, some principles can be figured out, specifically, Figure 6 illustrates the validation MSE’s curves with = 10, and Figure 7 illustrates the validation MSE’s curves when applying Temporal Attention. We can conclude that when applying oficial Attention, the model cannot converge consistently with diferent data sets. Figure 6 shows that model fails to converge when it learns on 2.5(0). Meanwhile, applying Temporal Attention, the model can obtain a better MSE. 4.2.2. Best Metrics during Training of our tests.

4.4. Ablation Study

Table 3 In order to validate the efectiveness of the modules, we Best metrics during training when applying Temporal Atten- conduct an ablation study on our GreenEyes model. We tion. remove the bidirectional LSTM module and the mutlihead attention module, respectively, and get two model variance, w/o Attention and w/o LSTM. We plot the 4.3. Model Evaluation model’s (w/o LSTM) training and validation curves as Figure and Figure show respectively.

Figure 8 shows that our model fits the labeled IAQI level It is easily concluded that, without the LSTM layer, the lines well, except that its predictions difer from the model runs into the overfit status. Although it still fits ground truth a little on some parts of the lines, espe- well on the train set, it is rambling on the validation set. cially on the turning corners. Figure 9 illustrates the In order to validate the Attention layer’s function, we same evaluation performance, which presents that the re-run the GreenEyes model with Temporal Attention model may not need much data to learn as to set stride on 2.5(0) to 2.5(3), and then cut of this Attento 2. To quantify the testing results of our model with tion layer and run the model again on the same data sets. diferent parameters, we test it on the whole 2.5 se- Table 5 shows the test MSE and MAE results of both conquence by setting stride as 1. Table 4 lists the statistics ifguration. It turns out that the model w/o Attention can perform better or is equivalent to the model applied with the Attention layer. However, by plotting the training curves again, we found that the model with the Temporal Attention layer can obtain smaller loss during training.

4.5. Hyper-parameter Discussion

Being inspired by the SOTA ideas of predicting the target sequence with a short sequence by using an autoregression model such as Autoformer [24], we approach to decrease the model’s input size, i.e., the data’s window size. We set the window size to 3600 (which means one hour on the timeline), and train our model again. Figure 12 shows our results. Empirically, the model gains well performance as long as it reduces the training loss under 0.01. Hence, except for the result on 2.5(3) when the window size is set to 3600, the model still needs optimization if we want a shorter window size. However, it is worth trying as the number of model parameters also decreases obviously as the input size is reduced. A light model saves computational costs and boosts inference.

5. Conclusion

The WaveNet model designed for audio data processing is generalizable and suitable for fitting problem. Our work successfully put it into usage for IAQI level fitting and prediction. It shows that our GreenEyes model based on WaveNet has strong data fitting capability for extreme long data sequences. When given a smaller stride, fed with more data, the model can learn better. It is also found that, when trained with more channels of sensor data, the model can perform well. This can be regard as sensor data augmentation. Our innovative method that human manually label the IAQI level is useful. It creates an appropriated target label function that the model can learn and solve the threshold fluctuation problem.

It is also promising that our GreenEyes AIoT deployment design can be put into practice. Actually we’ve developed an iOS app to retrieve the air quality data. Mobile framework such as Tensorflow Lite [ 25] has been developed. A mobile phone is hopefully to be installed with our GreenEyes model and monitor the IAQI data in realtime and predict the air trend.

Due to a lack of air quality data, we only did the data iftting task. We will perform the data predicting task in the future if enough data is gathered.

6. Related Works 6.1. Statistical & Machine Learning Approaches

Except for ARIMA, ETS models mentioned in our last chapter, traditional methods such as Kalman filter [ 26] are also very simple and practical for time series and forecasting problems. Random forests [15], XGBoost, and SVM [27] etc are useful machine learning methods too. About method choosing, the most suitable method is highly interrelated with the data’s properties and the application scenario.

In common, the essential of both traditional approaches and ML-based approaches is mining data and extracting features. Diferent from other feature engineering tasks, sliding windows are widely used for processing the data. Metrics such as the minimum, the maximum, the mean, and the variance of the data in the window are common features.

6.2. Deep Learning Approaches

LSTM-based deep learning methods have been developed recently to extract temporal patterns. Lai et al. proposed LSTNet [28] that encodes short-term local information into low dimensional vectors using 1D convolutional neural networks and decodes the vectors through an RNN. Shih et al. proposed TPA-LSTM [29] which processes the inputs by an RNN and employs a convolutional neural network to calculate the attention score across multiple steps.

The architecture of CNN is designed for 2D data like images. Meanwhile, recently a special variant of CNN called temporal convolutional networks (TCNs) [30] has been proposed that makes CNN capable for time series processing. Yan et al. [31] released their research work about using TCN for weather forecasting in 2020 and showed that TCN is better than the LSTM network in this application.

WaveNet related methods, including our GreenEyes model, tackle with a single sequence of time series data and show good fitting and forecasting performance concerning the prediction accuracy and data throughput capacity. Meanwhile, same with the same recent time as this thesis was being developed, new methods and approaches regarding time series forecasting have also been proposed. In recent years, graph neural networks (GNNs) have shown high capability in handling relational dependencies. Wu et al. [32] proposed a general graph neural network framework designed specifically for multivariate time series data. Their method is useful for extracts relations among variables belonging to multi sequences.

As Transformer [21] becomes great popular these years, another model based on Transforms has also been brought out. Lim et al. [33] from Google introduced the Temporal Fusion Transformer (TFT) as a novel attentionbased architecture which combines high-performance multi-horizon forecasting with interpretable insights into temporal dynamics. They created gate-based networks, GRN and GLU, as new approaches for better feature selection modules.

Acknowledgments

The authors would like to thank many friends for constructive discussions and feedbacks. Special thanks to Prof. Yuan Yao who voluntarily provides GPU machine. [11] L. Wang, M. Liu, M. Q.-H. Meng, R. Siegwart, To- [25] M. S. Louis, Z. Azad, L. Delshadtehrani, S. Gupta, wards real-time multi-sensor information retrieval P. Warden, V. J. Reddi, A. Joshi, Towards deep learnin cloud robotic system, in: 2012 IEEE International ing using tensorflow lite on risc-v, in: Third WorkConference on Multisensor Fusion and Integration shop on Computer Architecture Research with for Intelligent Systems (MFI), IEEE, 2012, pp. 21–26. RISC-V (CARRV), volume 1, 2019, p. 6. [12] P. Cai, S. Wang, Y. Sun, M. Liu, Probabilistic end- [26] V. Gómez, A. Maravall, Estimation, prediction, to-end vehicle navigation in complex dynamic en- and interpolation for nonstationary series with the vironments with multimodal sensor fusion, IEEE kalman filter, Journal of the American Statistical Robotics and Automation Letters 5 (2020) 4218– Association 89 (1994) 611–624.

4224. [27] N. I. Sapankevych, R. Sankar, Time series prediction [13] D. Bahdanau, K. Cho, Y. Bengio, Neural machine using support vector machines: a survey, IEEE translation by jointly learning to align and translate, Computational Intelligence Magazine 4 (2009) 24– arXiv preprint arXiv:1409.0473 (2014). 38. [14] S. Hochreiter, J. Schmidhuber, Long short-term [28] G. Lai, W.-C. Chang, Y. Yang, H. Liu, Modeling memory, Neural computation 9 (1997) 1735–1780. long-and short-term temporal patterns with deep [15] B. Rouet-Leduc, C. Hulbert, N. Lubbers, K. Barros, neural networks, in: The 41st International ACM C. J. Humphreys, P. A. Johnson, Machine learn- SIGIR Conference on Research & Development in ing predicts laboratory earthquakes, Geophysical Information Retrieval, 2018, pp. 95–104.

Research Letters 44 (2017) 9276–9282. [29] S.-Y. Shih, F.-K. Sun, H.-y. Lee, Temporal pattern [16] A. v. d. Oord, S. Dieleman, H. Zen, K. Simonyan, attention for multivariate time series forecasting, O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, Machine Learning 108 (2019) 1421–1441. K. Kavukcuoglu, Wavenet: A generative model for [30] C. Lea, R. Vidal, A. Reiter, G. D. Hager, Temporal raw audio, arXiv preprint arXiv:1609.03499 (2016). convolutional networks: A unified approach to ac[17] J. Shen, R. Pang, R. J. Weiss, M. Schuster, N. Jaitly, tion segmentation, in: European Conference on Z. Yang, Z. Chen, Y. Zhang, Y. Wang, R. Skerrv- Computer Vision, Springer, 2016, pp. 47–54. Ryan, et al., Natural tts synthesis by condition- [31] J. Yan, L. Mu, L. Wang, R. Ranjan, A. Y. Zomaya, ing wavenet on mel spectrogram predictions, in: Temporal convolutional networks for the advance 2018 IEEE International Conference on Acoustics, prediction of enso, Scientific reports 10 (2020) 1–15. Speech and Signal Processing (ICASSP), IEEE, 2018, [32] Z. Wu, S. Pan, G. Long, J. Jiang, X. Chang, C. Zhang, pp. 4779–4783. Connecting the dots: Multivariate time series fore[18] Y. Wang, R. Skerry-Ryan, D. Stanton, Y. Wu, R. J. casting with graph neural networks, in: ProceedWeiss, N. Jaitly, Z. Yang, Y. Xiao, Z. Chen, S. Ben- ings of the 26th ACM SIGKDD International Congio, et al., Tacotron: Towards end-to-end speech ference on Knowledge Discovery & Data Mining, synthesis, arXiv preprint arXiv:1703.10135 (2017). 2020, pp. 753–763. [19] K. Simonyan, A. Zisserman, Very deep convolu- [33] B. Lim, S. Ö. Arık, N. Loef, T. Pfister, Temporal tional networks for large-scale image recognition, fusion transformers for interpretable multi-horizon arXiv preprint arXiv:1409.1556 (2014). time series forecasting, International Journal of [20] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learn- Forecasting (2021).

ing for image recognition, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778. [21] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit,

L. Jones, A. N. Gomez, Ł. Kaiser, I. Polosukhin, Attention is all you need, in: Advances in neural information processing systems, 2017, pp. 5998– 6008. [22] M.-T. Luong, H. Pham, C. D. Manning, Efective approaches to attention-based neural machine translation, arXiv preprint arXiv:1508.04025 (2015). [23] D. P. Kingma, J. Ba, Adam: A method for stochastic

optimization, 2017. arXiv:1412.6980. [24] H. Wu, J. Xu, J. Wang, M. Long, Autoformer: Decomposition transformers with Auto-Correlation for long-term series forecasting, in: Advances in Neural Information Processing Systems, 2021.

[1]

W. H.

Organization , et al., Ambient air pollution: A global assessment of exposure and burden of disease ( 2016 ).

[2]

Lelieveld ,

J. S.

Evans ,

Fnais ,

Giannadaki ,

Pozzer , The contribution of outdoor air pollution sources to premature mortality on a global scale , Nature 525 ( 2015 ) 367 - 371 .

[3]

Kumar ,

Jasuja , Air quality monitoring system based on iot using raspberry pi , in: 2017 International Conference on Computing, Communication and Automation (ICCCA) , IEEE, 2017 , pp. 1341 - 1346 .

[4]

C.-S.

Oh ,

M.-S.

Seo ,

J.-H.

Lee ,

S.-H.

Kim , Y. -D. Kim , H.-J. Park, Indoor air quality monitoring systems in the iot environment , The Journal of Korean Institute of Communications and Information Sciences 40 ( 2015 ) 886 - 891 .

[5]

Zheng ,

Zhao ,

Yang ,

Xiong ,

Xiang , Design and implementation of lpwa-based air quality monitoring system , IEEE Access 4 ( 2016 ) 3238 - 3245 .

[6]

P. P.

Ray , Internet of things cloud based smart monitoring of air borne pm2. 5 density level , in: 2016 International Conference on Signal Processing, Communication, Power and Embedded System (SCOPES) , IEEE, 2016 , pp. 995 - 999 .

[7]

Han , H. Liu,

Zhu ,

Xiong ,

Dou , Joint air quality and weather prediction based on multi-adversarial spatiotemporal networks , arXiv preprint arXiv: 2012 . 15037 ( 2020 ).

[8]

Wang ,

Wu , Q. Niu, Multi-sensor fusion in automated driving: A survey , IEEE Access 8 ( 2019 ) 2847 - 2868 .

[9]

R. C.

Luo ,

M. G.

Kay , Multisensor integration and fusion in intelligent systems , IEEE Transactions on Systems, Man, and Cybernetics 19 ( 1989 ) 901 - 931 .

[10]

D. L.

Hall ,

Llinas , An introduction to multisensor data fusion , Proceedings of the IEEE 85 ( 1997 ) 6 - 23 .