A Comparative Study of LightGBM on Air Quality Data Across Multiple Locations Martina Casari1,∗ , Andrea Arigliano1 and Laura Po1 1 Department of Engineering ”Enzo Ferrari”, University of Modena and Reggio Emilia Abstract In this paper, we present a novel approach utilizing LightGBM algorithms to estimate PM2.5 concentrations in two distinct geographical locations, Turin in Italy and Southampton in the UK. Our methodology integrates data from low-cost sensors co-located with reference stations in both locations, ensuring data reliability. Through a rigorous analysis encompassing diverse splitting techniques, learning pipeline components, and feature selection methods, our approach showcases remarkable performance across various scenarios, promising practical applicability. We initially train and test our model on the Turin dataset, followed by an assessment of its performance within the specific geographical context. Furthermore, we extend our investigation to the Southampton dataset without any adjustments, revealing disparities in performance. Additionally, we conduct comparative training on both datasets, offering insights into contextual factors influencing model efficacy within specific geographical areas. Our findings underscore the importance of contextual considerations for accurate air quality estimation and highlight the potential of our approach for real-world deployment. The datasets used in this study are publicly available, facilitating further research and validation. Keywords Particulate Matter, Low-cost sensors, Different Locations, LightGBM, Open dataset 1. Introduction which is referred to as the dose. Studies examining this dose have shown that exposure to high concentrations of Airborne particulate matter (PM) refers to tiny particles PM can lead to damage at the cellular level, particularly in the air that can be composed of various materials such in the lungs. Several possible reactions can occur in re- as dust, dirt, soot, smoke, and liquid droplets. These par- sponse to certain environmental and chemical exposures. ticles vary in size and can have different chemical compo- One hypothesis is that the body may up-regulate its pro- sitions, originating from both natural and human-made duction of antioxidant enzymes to combat the negative sources [1]. Airborne PM consists of a heterogeneous effects of these exposures. In some cases, exposure can mixture of solid and liquid particles suspended in air that also result in cell death or an allergic immune response. varies continuously in size and chemical composition in Additionally, exposure can impair the body’s ability to space and time. PM is categorized based on the diameter defend the lungs and cause DNA damage. It’s important of the particles, measured in micrometres (𝜇𝑚) [2]. The to note that these effects can also have a ripple effect main classifications include PM1, PM2.5, PM4, and PM10, throughout the body, impacting other systems such as representing different size fractions, each of them caus- the cardiovascular system. Given the detrimental impact ing different problems regarding both the environmental that PM concentration in the atmosphere can have, ac- conditions, affecting ecosystems [3, 4], and human health curately forecasting future PM levels based on current [5] complications which mainly impact the respiratory air conditions is a critical undertaking. This effort is es- and cardiovascular systems, also potentially affecting the sential in preventing the various issues associated with bloodstream. Airborne PM can have severe environmen- PM exposure and implementing effective measures like tal consequences. When it settles on the soil, it can have traffic and viability restrictions to address them. a detrimental impact on the nutrient cycling of plants The objective of this study is to demonstrate the effective- and disrupt the ecosystem’s balance. This can potentially ness of the LightGBM algorithm in accurately forecast- lead to negative consequences on the entire food chain ing PM2.5 levels using cost-effective sensors and various and have long-lasting effects on the environment. When environmental parameters. Additionally, the study ex- it comes to health concerns, much attention has been plores the applicability of the method across different given to the amount of PM that enters a person’s body, locations, examining both homogeneous and heteroge- neous approaches. The training process relies on PM2.5 Ital-IA 2024: 4th National Conference on Artificial Intelligence, orga- measurements from reference stations, enabling the resul- nized by CINI, May 29-30, 2024, Naples, Italy ∗ Corresponding author. tant model to predict and adjust measurement readings Envelope-Open martina.casari@unimore.it (M. Casari); laura.po@unimore.it effectively. (L. Po) The article is structured as follows: Section 2 introduces Orcid 0000-0003-0406-3036 (M. Casari); 0000-0002-3345-176X (L. Po) the dataset; Section 3 outlines the methodology, includ- © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings ing the models used and the pipeline implemented; Sec- tion 4 presents the results and discussion; and Section 5 provides the conclusions. 2. Datasets The datasets considered in this study are created by a collection of measurements captured in two different geographical areas, both by using SPS30 low-cost (LC) sensors as input and the co-located legal stations as ref- erence: • Turin (Italy): LC sensors capturing records with 15-minute frequency, reference station (RS) with hourly frequency based on Arpa weather stations [6]; • Southampton (UK): LC sensors capturing records Figure 1: Correlation matrix of all the features. with 2 minutes frequency, RS sensors with hourly frequency based on Fidas200s weather stations [7]. 3. Methodology The data was obtained through individual sensor The research consisted of a methodical process with dis- measurements, which were then used to construct tinct stages. Firstly, a brute-force testing procedure was the raw datasets for both Turin and Southampton. carried out to determine the most appropriate machine- Subsequently, a thorough analysis of the LC and RS data learning model from a variety of options. Subsequently, was conducted to create a dataset linking each reference the pipeline was created by examining the ideal dataset record with a low-cost measurement. To achieve this, split, feature selection, and transformation techniques re- the input datasets were resampled to match the hourly quired for the specific task. Lastly, a thorough evaluation frequency of the reference datasets. of performance metrics was conducted using the Turin Initially, the resampling technique employed was dataset, including MAE, MSE, MdAE, and R2 metrics. averaging all the LC data over the RS hourly record. However, due to significant variations in the data within an hour, it was decided to assign the closest available 3.1. Model LC record to each RS record instead. After this process, The first step was to determine the appropriate model the raw datasets for both Turin and Southampton were for the problem at hand. To accomplish this, a Bulk Re- created, and preprocessing techniques [8] were applied gressor was implemented. This function tests a variety to uniformly adjust the data, preparing them for the of regression models from popular Python libraries, such training step. In the performance evaluation, just the as scikitlearn, on the target dataset, ultimately produc- preprocessed dataset was considered for comparison. ing a ranking of the most successful models based on Incorporating contextual features based on time into average prediction accuracy metrics. Interestingly, the the feature extraction process has allowed for a more top-performing models were nonlinear, indicating that thorough understanding of the data. This approach not interpreting the features required an examination of non- only captures the original features but also encodes linear relationships between them. As a result, LightGBM information about the time axis, enabling a fine and was chosen as the model for this study. accurate representation of patterns that unfold over time. LightGBM (Light Gradient Boosting Machine) is a power- Ultimately, this results in more insightful and precise ful and efficient gradient-boosting framework developed outcomes. by Microsoft researchers in 2017 [9]. It is designed to be efficient and scalable, making it particularly well-suited The final set of features included in the datasets for large datasets and high-dimensional feature spaces. comprises ”pm1,” ”pm2p5,” ”pm2p5 RF target,” ”pm4,” It utilizes the boosting framework, building an ensemble ”pm10,” ”wind speed,” ”pressure,” ”temperature,” ”relative of weak learners (decision trees) sequentially to mini- humidity,” ”month,” ”day of the week,” and ”hour.” The mize the overall prediction error, thus ultimately combin- correlation matrix is depicted in Figure 1. ing multiple weak models to create a strong predictive model. Unlike depth-first tree growth in traditional gra- dient boosting frameworks like XGBoost [10], LightGBM Table 1 adopts a leaf-wise tree growth strategy which chooses Dataset split with performance metrics over the preprocessed the leaf with the maximum delta loss to grow, which can Turin dataset lead to faster convergence and reduced computational MAE RMSE MdAE R2 cost. The trees are then used as usual, choosing the path that maximizes the information gain which is evaluated RTS 5.19 7.24 3.96 0.73 RDS 5.16 7.38 3.90 0.73 via the variance score of each node. Other character- RMS 5.21 7.25 3.97 0.73 istics are that it includes a feature selection process by FDS 6.86 8.87 5.82 0.62 itself and the loss used usually is the Mean Squared Error FMS 5.91 8.58 4.18 0.58 (MSE) Loss, Eq. 1. 𝑛 1 Table 2 𝑀𝑆𝐸 = ∑(𝑦 − 𝑦𝑖̂ )2 (1) 𝑛 𝑖=1 𝑖 Correlation of features with target variable Feature Absolute Correlation 3.2. Split Techniques pm1 0.588402 Different split configurations were tested in order to ob- pm2p5 0.546849 tain the optimal one for this case study, starting from a pm4 0.521004 simple random split and going towards more complex pm10 0.511054 temperature -0.473645 splits based on the time period considered. The different pressure 0.306547 splits considered are: relative humidity 0.297370 wind speed -0.227094 • Random Total Split (RTS): Random split among month -0.159161 all the records in the domain of the whole dataset; day of the week -0.014929 • Random Day Split (RDS): Random split obtained hour -0.006393 by grouping all the records by day, then randomly splitting in the subdomain of the single day; • Random Month Split (RMS): Random split ob- 3.3. Pipeline tained by grouping all the records by month, then randomly splitting in the subdomain of the single After selecting the model and dataset split method, the month; subsequent task involves determining the required data • Forecast Day Split (FDS): Forecast split obtained preparation techniques for this problem. The primary by grouping all the records by day, then assigning components of the data processing pipeline include fea- the first 75% to the train and the last 25% to the ture selection and skewness transformation. It is un- test in the subdomain of the single day; necessary to scale the data given the characteristics of LightGBM. • Forecast Month Split (FMS): Forecast split ob- tained by grouping all the records by month, then assigning the first 75% to the train and the last 3.3.1. Feature Selection 25% to the test in the subdomain of the single In this phase, the most representative features of the month; problem were extracted. Since there is a relatively low number of features, to begin with, the selection was done Every split considered kept a 75-25 ratio between the using a simple correlation method where the resulting training and test set, simply varying the domain consid- features are the ones which correlate with the target ered and whether the records were picked randomly or variable higher than a chosen threshold. sequentially. Each of the aforementioned split techniques As evident from Table 2, both ”day of the week” and was tested over the preprocessed Turin dataset to choose ”month” exhibit weak correlations with the target vari- the best-performing split for the next steps. able. As it is possible to infer from results in Table 1, the RTS seems to achieve the best results all across the board, 𝑛 𝑛 𝑛 but since we are working with time series the best choice 𝑛 ∑𝑖=1 𝑥𝑖 𝑦𝑖 − ∑𝑖=1 𝑥𝑖 ∑𝑖=1 𝑦𝑖 would be to not consider this split as it tends to over- 𝑟= 𝑛 2 𝑛 2 𝑛2 𝑛 2 estimate the results due to the data nature. Therefore, √𝑛 ∑𝑖=1 𝑥𝑖 − (∑𝑖=1 𝑥𝑖 ) √𝑛 ∑𝑖=1 𝑦𝑖 − (∑𝑖=1 𝑦𝑖 ) the split technique considered in the next steps of this (2) research is the RDS. The Pearson Correlation Coefficient (r) was utilized to assess these correlations, as indicated by Equation 2. Consequently, even if a negative correlation with the Table 3 target variable is obtained using this formula, it remains Best transformation for each feature valuable as it signifies an inverse correlation, akin to Feature Best Transformation inverse proportionality. Ultimately, the features selected by this method are those for which |r| > 0.1. pm1 Log Transformation pm2p5_x Log Transformation pm2p5_y QuantileTransformer 3.3.2. Skewness Transformation pm4 Log Transformation Skewness is a statistical measure that describes the asym- pm10 Log Transformation relative_humidity QuantileTransformer metry of the probability distribution of a real-valued temperature QuantileTransformer random variable. In simpler terms, it measures the de- wind_speed QuantileTransformer gree and direction of skew (departure from horizontal pressure QuantileTransformer symmetry) in a dataset. A skewness value of 0 indicates month QuantileTransformer a perfectly symmetrical distribution, see Eq. 3. Positive skewness indicates a longer right tail, while negative skewness indicates a longer left tail. 𝑛 𝑛 𝑥 − 𝑥̄ 3 Skewness = ∑( 𝑖 ) (3) (𝑛 − 1)(𝑛 − 2) 𝑖=1 𝑠 When dealing with regression problems, addressing highly skewed variables is crucial as they can impact the model’s fit. This is primarily due to the assumption of linearity made by most regression algorithms, which (a) Before (b) After presupposes linear relationships between features. By applying transformations such as power or logarithmic Figure 2: Distribution comparison with skewness transformer functions, this effect can be mitigated, especially consider- ing that the chosen model inherently possesses nonlinear properties. 4. Results and Discussion Additionally, highly skewed predictor variables can make the model overly sensitive to extremely high values, po- By applying all the aforementioned techniques, the final tentially resulting in a poor fit for the majority of the pipeline is created and then trained on the preprocessed data. Turin dataset with the RDS split method. To tackle this issue, a skewness transformation was in- corporated into the pipeline. This transformation applies Table 4 a predefined set of transformations to each feature in Performance metrics obtained from training the LightGBM order to reduce its skewness. The set of transformations model on the Turin preprocessed dataset. includes: Metric Turin Train Turin Test • Logarithm: 𝑓𝑡 = log(𝑓 ); MAE 0.3023 0.3369 • Exponential: 𝑓𝑡 = 𝑒 𝑓 ; RMSE 0.1467 0.1846 MDAE 0.2508 0.2775 • Square Root: 𝑓𝑡 = √𝑓; 𝑅2 Score 0.7435 0.6735 • Quantile: 𝑓𝑡 = 𝐹 −1 (𝑓 ); For each feature in the dataset, all transformations are As evident from Table 4, the selected pipeline demon- tested, and the one selected is the transformation that strates strong performance on both the Turin training minimizes the feature’s skewness to 0. and test sets. An example of feature skewness transformation is de- In Figure 3, the feature importance ranking for the con- picted in Figure 2, illustrating the distribution of tem- structed model is depicted. Observing the significance perature data. The second figure demonstrates the at- of meteorological features for the model’s predictions is tainment of a Gaussian distribution after applying the notable. Quantile Transformer. Table 3 shows the best transfor- The results presented in Table 5 highlight the mation found for each feature. performance achieved when applying the model to a distinct dataset, the Southampton dataset. Here, it is evident that the model’s prediction of outcomes is unsatisfactory. This suggests that while the model Table 6 Performance Metrics Metric MAE RMSE MdAE R2 Merged Dataset 3.52 5.78 2.08 0.78 Figure 3: Feature importance ranking Table 5 Southampton (UK) performance metrics obtained from train- ing the LightGBM model on the Turin preprocessed dataset. Metric UK Dataset MAE 6.4039 RMSE 82.9644 MDAE 4.5478 Figure 4: Bland–Altman plot for the merged dataset 𝑅2 Score -0.9130 However, upon analyzing the Bland-Altman plot in reliably predicts where results should fall within their Figure 4, it becomes apparent that there exist relatively value range, it struggles to accurately forecast how high absolute differences between the predicted and they are distributed over time. Consequently, it can be actual values, particularly within the first range of values inferred that the geographic location under study exerts where the majority of records are concentrated. This a significant influence on PM forecasting. discrepancy implies that while the predictions generally To tailor forecasting models to specific geographic fall within the desired range considering the wide scope zones, it is essential to incorporate the studied area of values (over 87k records), the model’s precision in as a feature or consider creating independent models predicting exact values is suboptimal. for each area under consideration. The challenge One possible explanation for this phenomenon is the faced by the model in this scenario may stem from variability of PM values across different geographical several factors, including the distinct nature of the areas attributable to diverse environmental conditions. datasets, their unique contextual considerations, and the Without incorporating a feature that delineates between temporal misalignment despite both datasets covering the two areas, the model treats the PM range as a unified an entire year. Furthermore, the placement of the domain for both datasets, endeavouring to predict within SPS30 sensors within different devices for Southamp- that domain without differentiation due to the absence ton and Turin introduces significant variability in the of pertinent information. These findings underscore the collected data due to positional and rotational differences. original hypothesis, emphasizing the necessity to either incorporate features that encapsulate environmental To delve deeper into this issue, an additional test conditions or devise distinct models for different areas, was performed by merging records from both the as the available features alone are insufficient to infer Southampton and Turin datasets. This merged dataset such information. served as the comprehensive training and testing dataset with the RDS split and was subsequently processed To conclude this discussion and affirm the thesis, through the aforementioned pipeline. The objective of a final test was conducted by creating a new independent this test was to develop a model capable of addressing model using only the Southampton data. both challenges simultaneously, by incorporating data The latest results presented in Table 7 serve to rein- from both geographical areas concurrently. force the thesis that tailoring a model to a specific geo- As we can see from the results in Table 6, this test graphical area yields superior outcomes in accurately cap- provided surprisingly good results all across the board, turing and predicting PM levels using machine learning with great values both in the distance metrics and in R2. Table 7 [3] X. Yue, Y. Hu, C. Tian, R. Xu, W. Yu, Y. Guo, In- Performance metrics for Southampton model creasing impacts of fire air pollution on public and ecosystem health, The Innovation 5 (2024) 100609. Metric MAE RMSE MdAE R2 [4] D. Grantz, J. Garner, D. Johnson, Ecological ef- Southampton Model 1.73 3.04 1.01 0.88 fects of particulate matter, Environment Interna- tional 29 (2003) 213–239. doi:https://doi.org/ 10.1016/S0160- 4120(02)00181- 2 , future Direc- techniques. The model trained exclusively on Southamp- tions in Air Quality Research : Ecological,Atmo- ton data demonstrates excellent performance across all spheric,Regulatory/Policy/Economic, and Educa- metrics utilized, consolidating the argument for geo- tional Issues. graphic specialization in PM forecasting models. [5] M. J. Mohammadi, B. F. Dehaghi, S. Mansou- rimoghadam, A. Sharhani, P. Amini, S. Ghan- bari, Cardiovascular disease, mortality and ex- 5. Conclusion posure to particulate matter (pm): a system- atic review and meta-analysis, Reviews on In conclusion, this paper presents a comprehensive study Environmental Health 39 (2024) 141–149. URL: on the development of the LightGBM model for predict- https://doi.org/10.1515/reveh-2022-0090. doi:doi: ing PM levels, highlighting the crucial role of geograph- 10.1515/reveh- 2022- 0090 . ical considerations in the process. The study evaluates [6] M. Casari, L. Po, L. Zini, Low-cost pm data, 2023. various dataset split techniques and identifies the RDS URL: https://doi.org/10.5281/zenodo.10037781. method as the most effective. The learning pipeline en- doi:10.5281/zenodo.10037781 , https://- compasses feature selection and skewness transforma- doi.org/10.5281/zenodo.10037781. tion. Remarkably, this pipeline achieves state-of-the-art [7] F. M. J. Bulot, Characterisation and calibra- results on both the Turin and Southampton datasets in- tion of low-cost pm sensors at high tempo- dependently. ral resolution to reference grade performances - Furthermore, a comparative analysis is conducted on dif- dataset, 2022. URL: https://doi.org/10.5281/zenodo. ferent combinations of data, as well as a merged dataset 7198378. doi:10.5281/zenodo.7198378 , https://- test incorporating data from both regions simultaneously. doi.org/10.5281/zenodo.7198378. However, the findings suggest that creating independent [8] M. Casari, L. Po, Mith: A framework for models for distinct geographical areas yields the best mitigating hygroscopicity in low-cost pm sen- performance for this case study, underscoring the sig- sors, Environmental Modelling & Software 173 nificance of environmental conditions surrounding the (2024) 105955. doi:https://doi.org/10.1016/j. utilized sensor. envsoft.2024.105955 . This research endeavours to shed light on laying the [9] G. Ke, Q. Meng, T. Finley, T. Wang, W. Chen, W. Ma, groundwork for constructing models capable of general- Q. Ye, T.-Y. Liu, Lightgbm: A highly efficient gra- izing, taking into account localized environmental factors dient boosting decision tree, in: I. Guyon, U. V. in the predictive modelling of PM levels. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vish- wanathan, R. Garnett (Eds.), Advances in Neural References Information Processing Systems, volume 30, Cur- ran Associates, Inc., 2017. [1] K. R. Daellenbach, G. Uzu, J. Jiang, L.-E. Cassagnes, [10] T. Chen, C. Guestrin, Xgboost: A scalable tree Z. Leni, A. Vlachou, G. Stefenelli, F. Canonaco, boosting system, in: Proceedings of the 22nd ACM S. Weber, A. Segers, J. J. P. Kuenen, M. Schaap, SIGKDD International Conference on Knowledge O. Favez, A. Albinet, S. Aksoyoglu, J. Dommen, Discovery and Data Mining, KDD ’16, Association U. Baltensperger, M. Geiser, I. El Haddad, J.-L. Jaf- for Computing Machinery, New York, NY, USA, frezo, A. S. H. Prévôt, Sources of particulate- 2016, p. 785–794. doi:10.1145/2939672.2939785 . matter air pollution and its oxidative potential in europe, Nature 587 (2020) 414 – 419. doi:10.1038/ s41586- 020- 2902- 8 , all Open Access, Green Open Access. A. Online Resources [2] A. Mukherjee, M. Agrawal, World air particulate The Turin dataset used in this study is freely available matter: sources, distribution and health effects, En- through the Zenodo platform [6]. vironmental chemistry letters 15 (2017) 283–309. doi:10.1007/s10311- 017- 0611- 9 .