Insights for Wellbeing: Predicting Personal Air Quality Index Using Regression Approach Amel Ksibi1,Amina Salhi1, Ala Alluhaidan1, Sahar A. El_Rahman1,2 1 College of Computer and Information Sciences, Princess Nourah Bint Abdulrahman University, Riyadh, Saudi Arabia 2 Electrical Engineering Department, Faculty of Engineering-Shoubra, Benha University, Cairo, Egypt amelksibi@pnu.edu.sa , Aisalhi@pnu.edu.sa, Asalluhaidan@pnu.edu.sa , sahr_ar@yahoo.com ABSTRACT only data from open sources (e.g., weather, air pollution data) to predict the personal air pollution data. Providing air pollution information to individuals enables them to However, it is not always possible to gather plentiful amounts of understand the air quality of their living environments. Thus, the such data. As a result, a key research question remains open: Can association between people’s wellbeing and the properties of the sparse or incomplete data be used to gain insight into wellbeing? surrounding environment is an essential area of investigation. This Meanwhile, machine learning techniques brought more paper proposes Air Quality Prediction through harvesting opportunities for accurate prediction of air pollution [4]. Thus, it is public/open data and leveraging them to get Personal Air Quality compulsory to find new approaches based on data analytics for index. These are usually incomplete. To cope with the problem of personal air quality prediction challenge. missing data, we applied KNN imputation method. To predict The objective of this study was to evaluate the ability of regression Personal Air Quality Index, we apply a voting regression approach approaches to predict individual air pollutants values and the air based on three base regressors which are Gradient Boosting quality index (AQI). regressor, Random Forest regressor and linear regressor. Our paper is organized as follows. In Section 2, we present state of Evaluating the experimental results using the RMSE metric, we got the art on air quality prediction methods. In Section 3, we discuss an average score of 35.39 for Walker and 51.16 for Car. proposed process for air pollutant prediction. Section 4 analyses the results while Section 5 covers discussion and conclusion. 1 INTRODUCTION Air pollution has an intensive impact on public health and the 2 RELATED WORK environment[1]. Providing air pollution information to individuals City-wide air quality prediction has been of interest over the past enables them to understand the air quality of their living 40 years[3]. However, all these studies focused only on environments. Thus, the association between people’s wellbeing determining the air pollutants values at city scale for general and the properties of the surrounding environment is an essential population. At personal scale, recent investigations are focusing on area of investigation[2]. In fact, public atmospheric monitoring crowdsourcing computing through harvesting data from wearable stations in urban areas provide large quantities of global air quality sensors[5]. These sensors provide lifelog data which can be data (GAQD) by deploying, across the globe, expensive high-end classified into two categories: numerical data ( weather data, air pollution sensors. These data including weather data environmental variables, GPS, time, health measurements, etc). (temperature, wind) and air pollution data (PM2.5, NO2, O3) This study focuses on personal air quality prediction using collected over the city, have been investigated widely for general numerical lifelog data. Personal air quality is a significant indicator population[3]. However, on the scale of individual people and its when evaluating the air pollution impact on personal health [6]. personal wellbeing, these research investigations are too limited, Predicting the personal air quality has a main challenge that is leading to a broad low accuracy and low spatio-temporal resolution, developing an effective model based on a small amount of sparse when assessing the impact of air pollution on personal health. or incomplete data training dataset. To deal with this issue, Zhao et With the plenitude of sensing devices, developing hypotheses about al. [7] proposed a prediction model based on CRNN (convolution the associations within the heterogenous sensors data captured from recurrent neural network) for short-term PM2.5 pollution prediction these devices, contributes towards building effective models that utilizing the spatial-temporal features of atmospheric sensing data. make it possible to understand the impact of the environment on The experiments conducted using the atmospheric sensing dataset wellbeing at the individual scale. Such models are necessary since from thirty-three coastal cities in China and Fukuokas not all cities are fully covered by standard air pollution and weather environmental monitoring dataset during 2015 to 2017. stations. The critical research question here is whether we can use Zhao et al. [6] designed a transfer learning model using an encoder- decoder structure using decoder transfer learning (DTL) that based on the Wasserstein distance to match the atmospheric monitoring Copyright 2020 for this paper by its authors. Use permitted under Creative stations data that is the source domain heterogeneous distribution Commons License Attribution 4.0 International (CC BY 4.0). MediaEval’20, December 14-15 2020, Online and the personal air quality that is the target domain. MediaEval’20, December 14-15 2020, Online A. Ksibi et al. The aforementioned methods focus on determining personal air voting regressor, we opt for Gradient Boosting regressor , Random quality index from various factors such as whether, GPS, and Forest regressor and linear regressor as base regressors. Gradient environmental data. In this paper, we aim to select the most boosting regressor relies on a loss function to be optimized, a weak important factors that influence the prediction of the personal air learner to make predictions, and an additive model to add weak quality data. learners for minimizing the loss function. This machine learning technique yields a prediction model usually by decision 3 METHODOLOGY trees. A Random Forest Regressor is a technique that uses multiple decision trees and Bootstrap Aggregation to produce a more Our proposed process contains two steps: data preprocessing and reliable prediction model. Linear regression, the most known then training a voting regressor to predict Personal Air Quality regression analysis is based on a linear predictor function with Prediction with public/open data. unknown model parameters. 3.1 Data preprocessing 4 EXPERIMENTAL RESULTS AND The dataset used in this paper is Personal air quality dataset (PAQD) which is described in [5]. It contains weather data (e.g., ANALYSIS temperature, humidity), atmospheric data (e.g., O3, PM2.5, and In this section, we report and discuss the experimental results NO2), GPS data, and multimedia data (e.g., images, annotation). achieved after submitting one run for the task1 “Personal Air Since the data quality and its representativeness play a crucial role Quality Prediction with public/open data”. Table1 represents the official results for our run based on regression approach. The in the effectiveness of prediction algorithm, we perform a process performance of the predictions was evaluated using root mean of data preprocessing to guarantee the quality of data. This process square error (RMSE). As can be seen in Table 1, SO2 achieved the consists of missing data imputation, feature extraction and features best results with score 12.08 using sensor data collected by walkers, selection. while NO2 showed the best results with score 25.02 using sensor a) Missing data imputation data collected by car. Moreover, we can see that the obtained results Based on the hypothesis that there is a strong correlation of for AQI from walker data outperforms those obtained from car heterogeneous data recordings at the near-by location and time, we data. This can be a clue that the quality of sensor data collected by estimate that two recordings are close if the features that neither is walkers outperforms the quality of data collected by Car. missing are close. So, we can determine the values of missing features according to the mean value from the k nearest recordings. Table 1: Official results of the submitted run Indeed, we used sklearn.impute.KNNImputer to predict the missing values and we defined k=5. PM2.5 NO2 O3 AQI b) Features extraction RMSE RMSE RMSE RMSE Based on the assumption that the level of pollution may vary from AVG walkers 35.34 25.98 12.08 35.39 one period to another on the same day and from one day to another AVG car 40.93 25.02 35.98 51.16 in the same month and from one month to another in the same year, we extracted the following features from datetime component to 5 CONCLUSIONS enrich the learning model with temporal information: month number [1–12], day[1-31], hour of the day [0–23], minute[0-59]. This paper represents our first attempt to address the task “Personal c) Features selection Air Quality Prediction with public/open data”. The proposed To select the most important features, we performed different solution was based on data preprocessing and training voting combinations of features and we applied a simple regressor over regressor based on three base regressors. The obtained results the training dataset. According to the obtained results, whether data demonstrate the quality of sensor data collected by walker. As increases the RMSE. So, we decide to focus only on Time Data and future work, we would investigate on transfer learning over GPS data to predict the values of pollutant variables O3, PM2.5, multimedia lifelog data such as egocentric photos and videos to get and NO2. insights about individual wellbeing. 3.2 Personal Air Quality Prediction with ACKNOWLEDGMENTS public/open data The authors extend their appreciation to the Deputyship for The Personal Air Quality Prediction can be represented as a Research & Innovation, Ministry of Education in Saudi regression problem where we are required to determine a continue Arabia for funding this research work through the project value that is the AQI. Given the selected features, we apply a number PNU-DRI-RI-20-033. regression approach to estimate the value of each pollutant variable. For this issue, we test different regressor models over the training dataset and we obtain the best results with the voting regressor. the voting regressor is an ensemble meta-estimator that fits several base regressors, each on the whole dataset. The algorithm then averages the individual predictions to form a final prediction. In our Insight for Wellbeing: Multimodal personal health lifelog data analysis MediaEval’20, December 14-15 2020, Online REFERENCES [1] Song, H., Lane, K. J., Kim, H., Kim, H., Byun, G., Le, M., Choi, Y., Park, C. R., & Lee, J. T. (2019). Association between Urban Greenness and Depressive Symptoms: Evaluation of Greenness Using Various Indicators, International journal of environmental research and public health, 16(2), 173. [2] P. Vo, T. Phan, M. Dao and K. Zettsu, Association Model between Visual Feature and AQI Rank Using Lifelog Data, 2019 IEEE International Conference on Big Data (Big Data), Los Angeles, CA, USA, 2019, pp. 4197-4200 [3] Y.Xu, W.Yang, and J.Wang, “Air quality early-warning system for cities in china,” Atmospheric Environment,vol.148,pp.239–257,2017. [4] S. Ameer, M. A. Shah, A. Khan, H. Song, C. Maple, S. U. Islam, andM. N. Asghar, “Comparative analysis of machine learning techniques for predicting air quality in smart cities,” IEEE Access, vol. 7, pp. 128 325–128 338, 2019. [5] Dao, M. S., Zhao, P. J, Nguyen, N.T., Nguyen, T.B., Dang- Nguyen D. T., Gurrin, C., “Overview of mediaeval2020: Insights for wellbeing task - multimodal personal health lifelog data analysis,” in MediaEval Benchmarking Initiative for Multimedia Evaluation, CEUR Workshop Proceedings, Dec 2020. [6] Zhao, P. and Zettsu, K., Decoder Transfer Learning for Predicting Personal Exposure to Air Pollution, 2019 IEEE International Conference on Big Data (Big Data), Los Angeles, CA, USA, 2019, pp. 5620-5629. [7] Zhao, P. and Zettsu, K., Convolution Recurrent Neural Networks for Short-Term Prediction of Atmospheric Sensing Data, The 4th IEEE International Conference on Smart Data (SmartData 2018), pp.815-821 3