1. Introduction

1613-0073

Comparative Study of LightGBM on Air Quality Data Across Multiple Locations

Martina Casari

martina.casari@unimore.it 0 1

Andrea Arigliano

0 1

Laura Po

laura.po@unimore.it 0 1 0 Department of Engineering ”Enzo Ferrari”, University of Modena and Reggio Emilia 1 Particulate Matter , Low-cost sensors, Diferent Locations, LightGBM, Open dataset

2024

29 30

In this paper, we present a novel approach utilizing LightGBM algorithms to estimate PM2.5 concentrations in two distinct geographical locations, Turin in Italy and Southampton in the UK. Our methodology integrates data from low-cost sensors co-located with reference stations in both locations, ensuring data reliability. Through a rigorous analysis encompassing diverse splitting techniques, learning pipeline components, and feature selection methods, our approach showcases remarkable performance across various scenarios, promising practical applicability. We initially train and test our model on the Turin dataset, followed by an assessment of its performance within the specific geographical context. Furthermore, we extend our investigation to the Southampton dataset without any adjustments, revealing disparities in performance. Additionally, we conduct comparative training on both datasets, ofering insights into contextual factors influencing model eficacy within specific geographical areas. Our findings underscore the importance of contextual considerations for accurate air quality estimation and highlight the potential of our approach for real-world deployment. The datasets used in this study are publicly available, facilitating further research and validation.

1. Introduction

Airborne particulate matter (PM) refers to tiny particles in the air that can be composed of various materials such ticles vary in size and can have diferent chemical compositions, originating from both natural and human-made sources [1]. Airborne PM consists of a heterogeneous mixture of solid and liquid particles suspended in air that varies continuously in size and chemical composition in space and time. PM is categorized based on the diameter of the particles, measured in micrometres ( ) [2]. The main classifications include PM1, PM2.5, PM4, and PM10, representing diferent size fractions, each of them causing diferent problems regarding both the environmental conditions, afecting ecosystems [ 3, 4], and human health [5] complications which mainly impact the respiratory and cardiovascular systems, also potentially afecting the tal consequences. When it settles on the soil, it can have a detrimental impact on the nutrient cycling of plants and disrupt the ecosystem’s balance. This can potentially lead to negative consequences on the entire food chain and have long-lasting efects on the environment. When it comes to health concerns, much attention has been nEvelop-O (L. Po) Ital-IA 2024: 4th National Conference on Artificial Intelligence, orga∗Corresponding author. CEUR

ceur-ws.org ing the models used and the pipeline implemented; Section 4 presents the results and discussion; and Section 5 provides the conclusions.

2. Datasets

The datasets considered in this study are created by a collection of measurements captured in two diferent geographical areas, both by using SPS30 low-cost (LC) sensors as input and the co-located legal stations as reference: • Turin (Italy): LC sensors capturing records with 15-minute frequency, reference station (RS) with hourly frequency based on Arpa weather stations [6]; • Southampton (UK): LC sensors capturing records with 2 minutes frequency, RS sensors with hourly frequency based on Fidas200s weather stations [7].

3. Methodology

The data was obtained through individual sensor measurements, which were then used to construct The research consisted of a methodical process with disthe raw datasets for both Turin and Southampton. tinct stages. Firstly, a brute-force testing procedure was Subsequently, a thorough analysis of the LC and RS data carried out to determine the most appropriate machinewas conducted to create a dataset linking each reference learning model from a variety of options. Subsequently, record with a low-cost measurement. To achieve this, the pipeline was created by examining the ideal dataset the input datasets were resampled to match the hourly split, feature selection, and transformation techniques refrequency of the reference datasets. quired for the specific task. Lastly, a thorough evaluation Initially, the resampling technique employed was of performance metrics was conducted using the Turin averaging all the LC data over the RS hourly record. dataset, including MAE, MSE, MdAE, and R2 metrics. However, due to significant variations in the data within an hour, it was decided to assign the closest available 3.1. Model LC record to each RS record instead. After this process, the raw datasets for both Turin and Southampton were created, and preprocessing techniques [8] were applied to uniformly adjust the data, preparing them for the training step. In the performance evaluation, just the preprocessed dataset was considered for comparison.

Incorporating contextual features based on time into the feature extraction process has allowed for a more thorough understanding of the data. This approach not only captures the original features but also encodes information about the time axis, enabling a fine and accurate representation of patterns that unfold over time.

Ultimately, this results in more insightful and precise outcomes.

The first step was to determine the appropriate model for the problem at hand. To accomplish this, a Bulk Regressor was implemented. This function tests a variety of regression models from popular Python libraries, such as scikitlearn, on the target dataset, ultimately producing a ranking of the most successful models based on average prediction accuracy metrics. Interestingly, the top-performing models were nonlinear, indicating that interpreting the features required an examination of nonlinear relationships between them. As a result, LightGBM was chosen as the model for this study.

LightGBM (Light Gradient Boosting Machine) is a powerful and eficient gradient-boosting framework developed by Microsoft researchers in 2017 [ 9]. It is designed to be eficient and scalable, making it particularly well-suited for large datasets and high-dimensional feature spaces.

It utilizes the boosting framework, building an ensemble of weak learners (decision trees) sequentially to minimize the overall prediction error, thus ultimately combining multiple weak models to create a strong predictive model. Unlike depth-first tree growth in traditional graThe final set of features included in the datasets comprises ”pm1,” ”pm2p5,” ”pm2p5 RF target,” ”pm4,” ”pm10,” ”wind speed,” ”pressure,” ”temperature,” ”relative humidity,” ”month,” ”day of the week,” and ”hour.” The correlation matrix is depicted in Figure 1. dient boosting frameworks like XGBoost [10], LightGBM adopts a leaf-wise tree growth strategy which chooses the leaf with the maximum delta loss to grow, which can lead to faster convergence and reduced computational cost. The trees are then used as usual, choosing the path that maximizes the information gain which is evaluated via the variance score of each node. Other characteristics are that it includes a feature selection process by itself and the loss used usually is the Mean Squared Error (MSE) Loss, Eq. 1.

1 =1

∑( − ̂ )2 3.2. Split Techniques Diferent split configurations were tested in order to obtain the optimal one for this case study, starting from a simple random split and going towards more complex splits based on the time period considered. The diferent splits considered are: • Random Total Split (RTS): Random split among

all the records in the domain of the whole dataset; • Random Day Split (RDS): Random split obtained by grouping all the records by day, then randomly splitting in the subdomain of the single day; tained by grouping all the records by month, then randomly splitting in the subdomain of the single month; • Forecast Day Split (FDS): Forecast split obtained by grouping all the records by day, then assigning the first 75% to the train and the last 25% to the test in the subdomain of the single day; • Forecast Month Split (FMS): Forecast split obtained by grouping all the records by month, then assigning the first 75% to the train and the last 25% to the test in the subdomain of the single month;

Every split considered kept a 75-25 ratio between the training and test set, simply varying the domain considered and whether the records were picked randomly or sequentially. Each of the aforementioned split techniques was tested over the preprocessed Turin dataset to choose the best-performing split for the next steps.

As it is possible to infer from results in Table 1, the RTS seems to achieve the best results all across the board, but since we are working with time series the best choice would be to not consider this split as it tends to overestimate the results due to the data nature. Therefore, the split technique considered in the next steps of this research is the RDS.

• Random Month Split (RMS): Random split ob- 3.3. Pipeline Dataset split with performance metrics over the preprocessed to assess these correlations, as indicated by Equation 2.

Consequently, even if a negative correlation with the target variable is obtained using this formula, it remains valuable as it signifies an inverse correlation, akin to inverse proportionality. Ultimately, the features selected by this method are those for which |r| > 0.1. 3.3.2. Skewness Transformation Skewness is a statistical measure that describes the asymmetry of the probability distribution of a real-valued random variable. In simpler terms, it measures the degree and direction of skew (departure from horizontal symmetry) in a dataset. A skewness value of 0 indicates a perfectly symmetrical distribution, see Eq. 3. Positive skewness indicates a longer right tail, while negative skewness indicates a longer left tail.

Skewness =

( − 1)( − 2) =1 ∑ ( − ̄)3

(3)

When dealing with regression problems, addressing highly skewed variables is crucial as they can impact the model’s fit. This is primarily due to the assumption of linearity made by most regression algorithms, which presupposes linear relationships between features. By applying transformations such as power or logarithmic functions, this efect can be mitigated, especially considering that the chosen model inherently possesses nonlinear

4. Results and Discussion

properties. data. includes: Additionally, highly skewed predictor variables can make the model overly sensitive to extremely high values, po- By applying all the aforementioned techniques, the final tentially resulting in a poor fit for the majority of the To tackle this issue, a skewness transformation was incorporated into the pipeline. This transformation applies a predefined set of transformations to each feature in order to reduce its skewness. The set of transformations pipeline is created and then trained on the preprocessed Turin dataset with the RDS split method. reliably predicts where results should fall within their value range, it struggles to accurately forecast how they are distributed over time. Consequently, it can be inferred that the geographic location under study exerts a significant influence on PM forecasting.

To tailor forecasting models to specific geographic zones, it is essential to incorporate the studied area as a feature or consider creating independent models for each area under consideration. The challenge faced by the model in this scenario may stem from several factors, including the distinct nature of the datasets, their unique contextual considerations, and the temporal misalignment despite both datasets covering an entire year. Furthermore, the placement of the SPS30 sensors within diferent devices for Southampton and Turin introduces significant variability in the collected data due to positional and rotational diferences.

To delve deeper into this issue, an additional test was performed by merging records from both the Southampton and Turin datasets. This merged dataset served as the comprehensive training and testing dataset with the RDS split and was subsequently processed through the aforementioned pipeline. The objective of To conclude this discussion and afirm the thesis, this test was to develop a model capable of addressing a final test was conducted by creating a new independent both challenges simultaneously, by incorporating data model using only the Southampton data. from both geographical areas concurrently. The latest results presented in Table 7 serve to rein

As we can see from the results in Table 6, this test force the thesis that tailoring a model to a specific geoprovided surprisingly good results all across the board, graphical area yields superior outcomes in accurately capwith great values both in the distance metrics and in R2. turing and predicting PM levels using machine learning

However, upon analyzing the Bland-Altman plot in Figure 4, it becomes apparent that there exist relatively high absolute diferences between the predicted and actual values, particularly within the first range of values where the majority of records are concentrated. This discrepancy implies that while the predictions generally fall within the desired range considering the wide scope of values (over 87k records), the model’s precision in predicting exact values is suboptimal.

One possible explanation for this phenomenon is the variability of PM values across diferent geographical areas attributable to diverse environmental conditions.

Without incorporating a feature that delineates between the two areas, the model treats the PM range as a unified domain for both datasets, endeavouring to predict within that domain without diferentiation due to the absence of pertinent information. These findings underscore the original hypothesis, emphasizing the necessity to either incorporate features that encapsulate environmental conditions or devise distinct models for diferent areas, as the available features alone are insuficient to infer such information. techniques. The model trained exclusively on Southampton data demonstrates excellent performance across all metrics utilized, consolidating the argument for geographic specialization in PM forecasting models.

5. Conclusion

In conclusion, this paper presents a comprehensive study on the development of the LightGBM model for predicting PM levels, highlighting the crucial role of geographical considerations in the process. The study evaluates various dataset split techniques and identifies the RDS method as the most efective. The learning pipeline encompasses feature selection and skewness transformation. Remarkably, this pipeline achieves state-of-the-art results on both the Turin and Southampton datasets independently.

Furthermore, a comparative analysis is conducted on different combinations of data, as well as a merged dataset test incorporating data from both regions simultaneously.

However, the findings suggest that creating independent models for distinct geographical areas yields the best performance for this case study, underscoring the significance of environmental conditions surrounding the utilized sensor.

This research endeavours to shed light on laying the groundwork for constructing models capable of generalizing, taking into account localized environmental factors in the predictive modelling of PM levels. [3] X. Yue, Y. Hu, C. Tian, R. Xu, W. Yu, Y. Guo, Increasing impacts of fire air pollution on public and ecosystem health, The Innovation 5 (2024) 100609.

Metric MAE RMSE MdAE R2 [4] D. Grantz, J. Garner, D. Johnson, Ecological efSouthampton Model 1.73 3.04 1.01 0.88 fects of particulate matter, Environment International 29 (2003) 213–239. doi:https://doi.org/ 10.1016/S0160-4120(02)00181-2, future Directions in Air Quality Research : Ecological,Atmospheric,Regulatory/Policy/Economic, and Educational Issues. [5] M. J. Mohammadi, B. F. Dehaghi, S. Mansourimoghadam, A. Sharhani, P. Amini, S. Ghanbari, Cardiovascular disease, mortality and exposure to particulate matter (pm): a systematic review and meta-analysis, Reviews on Environmental Health 39 (2024) 141–149. URL: https://doi.org/10.1515/reveh-2022-0090. doi:doi: 10.1515/reveh-2022-0090. [6] M. Casari, L. Po, L. Zini, Low-cost pm data, 2023.

URL: https://doi.org/10.5281/zenodo.10037781. doi:10.5281/zenodo.10037781, https://doi.org/10.5281/zenodo.10037781. [7] F. M. J. Bulot, Characterisation and calibration of low-cost pm sensors at high temporal resolution to reference grade performances dataset, 2022. URL: https://doi.org/10.5281/zenodo. 7198378. doi:10.5281/zenodo.7198378, https://doi.org/10.5281/zenodo.7198378. [8] M. Casari, L. Po, Mith: A framework for mitigating hygroscopicity in low-cost pm sensors, Environmental Modelling & Software 173 (2024) 105955. doi:https://doi.org/10.1016/j.

envsoft.2024.105955. [9] G. Ke, Q. Meng, T. Finley, T. Wang, W. Chen, W. Ma,

Q. Ye, T.-Y. Liu, Lightgbm: A highly eficient gradient boosting decision tree, in: I. Guyon, U. V.

Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, R. Garnett (Eds.), Advances in Neural Information Processing Systems, volume 30, Curran Associates, Inc., 2017. [1] K. R. Daellenbach, G. Uzu, J. Jiang, L.-E. Cassagnes, [10] T. Chen, C. Guestrin, Xgboost: A scalable tree Z. Leni, A. Vlachou, G. Stefenelli, F. Canonaco, boosting system, in: Proceedings of the 22nd ACM S. Weber, A. Segers, J. J. P. Kuenen, M. Schaap, SIGKDD International Conference on Knowledge O. Favez, A. Albinet, S. Aksoyoglu, J. Dommen, Discovery and Data Mining, KDD ’16, Association U. Baltensperger, M. Geiser, I. El Haddad, J.-L. Jaf- for Computing Machinery, New York, NY, USA, frezo, A. S. H. Prévôt, Sources of particulate- 2016, p. 785–794. doi:10.1145/2939672.2939785. matter air pollution and its oxidative potential in europe, Nature 587 (2020) 414 – 419. doi:10.1038/ s41586-020-2902-8, all Open Access, Green Open A. Online Resources

Access. [2] A. Mukherjee, M. Agrawal, World air particulate The Turin dataset used in this study is freely available matter: sources, distribution and health efects, En- through the Zenodo platform [6]. vironmental chemistry letters 15 (2017) 283–309. doi:10.1007/s10311-017-0611-9.