A Comparative Study of LightGBM on Air Quality Data
                                Across Multiple Locations
                                Martina Casari1,∗ , Andrea Arigliano1 and Laura Po1
                                1
                                    Department of Engineering ”Enzo Ferrari”, University of Modena and Reggio Emilia


                                                  Abstract
                                                  In this paper, we present a novel approach utilizing LightGBM algorithms to estimate PM2.5 concentrations in two distinct
                                                  geographical locations, Turin in Italy and Southampton in the UK. Our methodology integrates data from low-cost sensors
                                                  co-located with reference stations in both locations, ensuring data reliability. Through a rigorous analysis encompassing
                                                  diverse splitting techniques, learning pipeline components, and feature selection methods, our approach showcases remarkable
                                                  performance across various scenarios, promising practical applicability. We initially train and test our model on the Turin
                                                  dataset, followed by an assessment of its performance within the specific geographical context. Furthermore, we extend our
                                                  investigation to the Southampton dataset without any adjustments, revealing disparities in performance. Additionally, we
                                                  conduct comparative training on both datasets, offering insights into contextual factors influencing model efficacy within
                                                  specific geographical areas. Our findings underscore the importance of contextual considerations for accurate air quality
                                                  estimation and highlight the potential of our approach for real-world deployment. The datasets used in this study are publicly
                                                  available, facilitating further research and validation.

                                                  Keywords
                                                  Particulate Matter, Low-cost sensors, Different Locations, LightGBM, Open dataset


                                1. Introduction                                                                                                  which is referred to as the dose. Studies examining this
                                                                                                                                                 dose have shown that exposure to high concentrations of
                                Airborne particulate matter (PM) refers to tiny particles PM can lead to damage at the cellular level, particularly
                                in the air that can be composed of various materials such in the lungs. Several possible reactions can occur in re-
                                as dust, dirt, soot, smoke, and liquid droplets. These par- sponse to certain environmental and chemical exposures.
                                ticles vary in size and can have different chemical compo- One hypothesis is that the body may up-regulate its pro-
                                sitions, originating from both natural and human-made duction of antioxidant enzymes to combat the negative
                                sources [1]. Airborne PM consists of a heterogeneous effects of these exposures. In some cases, exposure can
                                mixture of solid and liquid particles suspended in air that also result in cell death or an allergic immune response.
                                varies continuously in size and chemical composition in Additionally, exposure can impair the body’s ability to
                                space and time. PM is categorized based on the diameter defend the lungs and cause DNA damage. It’s important
                                of the particles, measured in micrometres (𝜇𝑚) [2]. The to note that these effects can also have a ripple effect
                                main classifications include PM1, PM2.5, PM4, and PM10, throughout the body, impacting other systems such as
                                representing different size fractions, each of them caus- the cardiovascular system. Given the detrimental impact
                                ing different problems regarding both the environmental that PM concentration in the atmosphere can have, ac-
                                conditions, affecting ecosystems [3, 4], and human health curately forecasting future PM levels based on current
                                [5] complications which mainly impact the respiratory air conditions is a critical undertaking. This effort is es-
                                and cardiovascular systems, also potentially affecting the sential in preventing the various issues associated with
                                bloodstream. Airborne PM can have severe environmen- PM exposure and implementing effective measures like
                                tal consequences. When it settles on the soil, it can have traffic and viability restrictions to address them.
                                a detrimental impact on the nutrient cycling of plants The objective of this study is to demonstrate the effective-
                                and disrupt the ecosystem’s balance. This can potentially ness of the LightGBM algorithm in accurately forecast-
                                lead to negative consequences on the entire food chain ing PM2.5 levels using cost-effective sensors and various
                                and have long-lasting effects on the environment. When environmental parameters. Additionally, the study ex-
                                it comes to health concerns, much attention has been plores the applicability of the method across different
                                given to the amount of PM that enters a person’s body, locations, examining both homogeneous and heteroge-
                                                                                                                                                 neous approaches. The training process relies on PM2.5
                                Ital-IA 2024: 4th National Conference on Artificial Intelligence, orga-
                                                                                                                                                 measurements from reference stations, enabling the resul-
                                nized by CINI, May 29-30, 2024, Naples, Italy
                                ∗
                                     Corresponding author.                                                                                       tant model to predict and adjust measurement readings
                                Envelope-Open martina.casari@unimore.it (M. Casari); laura.po@unimore.it                                         effectively.
                                (L. Po)                                                                                                          The article is structured as follows: Section 2 introduces
                                Orcid 0000-0003-0406-3036 (M. Casari); 0000-0002-3345-176X (L. Po)                                               the dataset; Section 3 outlines the methodology, includ-
                                                    © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License
                                            Attribution 4.0 International (CC BY 4.0).


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
ing the models used and the pipeline implemented; Sec-
tion 4 presents the results and discussion; and Section 5
provides the conclusions.


2. Datasets
The datasets considered in this study are created by a
collection of measurements captured in two different
geographical areas, both by using SPS30 low-cost (LC)
sensors as input and the co-located legal stations as ref-
erence:

     • Turin (Italy): LC sensors capturing records with
       15-minute frequency, reference station (RS) with
       hourly frequency based on Arpa weather stations
       [6];
     • Southampton (UK): LC sensors capturing records        Figure 1: Correlation matrix of all the features.
       with 2 minutes frequency, RS sensors with hourly
       frequency based on Fidas200s weather stations
       [7].
                                                             3. Methodology
   The data was obtained through individual sensor
                                                             The research consisted of a methodical process with dis-
measurements, which were then used to construct
                                                             tinct stages. Firstly, a brute-force testing procedure was
the raw datasets for both Turin and Southampton.
                                                             carried out to determine the most appropriate machine-
Subsequently, a thorough analysis of the LC and RS data
                                                             learning model from a variety of options. Subsequently,
was conducted to create a dataset linking each reference
                                                             the pipeline was created by examining the ideal dataset
record with a low-cost measurement. To achieve this,
                                                             split, feature selection, and transformation techniques re-
the input datasets were resampled to match the hourly
                                                             quired for the specific task. Lastly, a thorough evaluation
frequency of the reference datasets.
                                                             of performance metrics was conducted using the Turin
Initially, the resampling technique employed was
                                                             dataset, including MAE, MSE, MdAE, and R2 metrics.
averaging all the LC data over the RS hourly record.
However, due to significant variations in the data within
an hour, it was decided to assign the closest available      3.1. Model
LC record to each RS record instead. After this process,
                                                           The first step was to determine the appropriate model
the raw datasets for both Turin and Southampton were
                                                           for the problem at hand. To accomplish this, a Bulk Re-
created, and preprocessing techniques [8] were applied
                                                           gressor was implemented. This function tests a variety
to uniformly adjust the data, preparing them for the
                                                           of regression models from popular Python libraries, such
training step. In the performance evaluation, just the
                                                           as scikitlearn, on the target dataset, ultimately produc-
preprocessed dataset was considered for comparison.
                                                           ing a ranking of the most successful models based on
Incorporating contextual features based on time into
                                                           average prediction accuracy metrics. Interestingly, the
the feature extraction process has allowed for a more
                                                           top-performing models were nonlinear, indicating that
thorough understanding of the data. This approach not
                                                           interpreting the features required an examination of non-
only captures the original features but also encodes
                                                           linear relationships between them. As a result, LightGBM
information about the time axis, enabling a fine and
                                                           was chosen as the model for this study.
accurate representation of patterns that unfold over time.
                                                           LightGBM (Light Gradient Boosting Machine) is a power-
Ultimately, this results in more insightful and precise
                                                           ful and efficient gradient-boosting framework developed
outcomes.
                                                           by Microsoft researchers in 2017 [9]. It is designed to be
                                                           efficient and scalable, making it particularly well-suited
The final set of features included in the datasets
                                                           for large datasets and high-dimensional feature spaces.
comprises ”pm1,” ”pm2p5,” ”pm2p5 RF target,” ”pm4,”
                                                           It utilizes the boosting framework, building an ensemble
”pm10,” ”wind speed,” ”pressure,” ”temperature,” ”relative
                                                           of weak learners (decision trees) sequentially to mini-
humidity,” ”month,” ”day of the week,” and ”hour.” The
                                                           mize the overall prediction error, thus ultimately combin-
correlation matrix is depicted in Figure 1.
                                                           ing multiple weak models to create a strong predictive
                                                           model. Unlike depth-first tree growth in traditional gra-
dient boosting frameworks like XGBoost [10], LightGBM Table 1
adopts a leaf-wise tree growth strategy which chooses Dataset split with performance metrics over the preprocessed
the leaf with the maximum delta loss to grow, which can Turin dataset
lead to faster convergence and reduced computational                    MAE RMSE MdAE                 R2
cost. The trees are then used as usual, choosing the path
that maximizes the information gain which is evaluated           RTS     5.19    7.24       3.96     0.73
                                                                 RDS     5.16    7.38       3.90     0.73
via the variance score of each node. Other character-
                                                                 RMS     5.21    7.25       3.97     0.73
istics are that it includes a feature selection process by       FDS     6.86    8.87       5.82     0.62
itself and the loss used usually is the Mean Squared Error       FMS     5.91    8.58       4.18     0.58
(MSE) Loss, Eq. 1.
                             𝑛
                          1                                   Table 2
                 𝑀𝑆𝐸 =      ∑(𝑦 − 𝑦𝑖̂ )2               (1)
                          𝑛 𝑖=1 𝑖                             Correlation of features with target variable

                                                                      Feature               Absolute Correlation
3.2. Split Techniques                                                 pm1                            0.588402
Different split configurations were tested in order to ob-            pm2p5                          0.546849
tain the optimal one for this case study, starting from a             pm4                            0.521004
simple random split and going towards more complex                    pm10                           0.511054
                                                                      temperature                   -0.473645
splits based on the time period considered. The different
                                                                      pressure                       0.306547
splits considered are:                                                relative humidity              0.297370
                                                                      wind speed                    -0.227094
     • Random Total Split (RTS): Random split among                   month                         -0.159161
       all the records in the domain of the whole dataset;            day of the week               -0.014929
     • Random Day Split (RDS): Random split obtained                  hour                          -0.006393
       by grouping all the records by day, then randomly
       splitting in the subdomain of the single day;
     • Random Month Split (RMS): Random split ob-             3.3. Pipeline
       tained by grouping all the records by month, then
       randomly splitting in the subdomain of the single      After selecting the model and dataset split method, the
       month;                                                 subsequent task involves determining the required data
     • Forecast Day Split (FDS): Forecast split obtained      preparation techniques for this problem. The primary
       by grouping all the records by day, then assigning     components of the data processing pipeline include fea-
       the first 75% to the train and the last 25% to the     ture selection and skewness transformation. It is un-
       test in the subdomain of the single day;               necessary to scale the data given the characteristics of
                                                              LightGBM.
     • Forecast Month Split (FMS): Forecast split ob-
       tained by grouping all the records by month, then
       assigning the first 75% to the train and the last      3.3.1. Feature Selection
       25% to the test in the subdomain of the single         In this phase, the most representative features of the
       month;                                                 problem were extracted. Since there is a relatively low
                                                              number of features, to begin with, the selection was done
   Every split considered kept a 75-25 ratio between the      using a simple correlation method where the resulting
training and test set, simply varying the domain consid-      features are the ones which correlate with the target
ered and whether the records were picked randomly or          variable higher than a chosen threshold.
sequentially. Each of the aforementioned split techniques       As evident from Table 2, both ”day of the week” and
was tested over the preprocessed Turin dataset to choose      ”month” exhibit weak correlations with the target vari-
the best-performing split for the next steps.                 able.
   As it is possible to infer from results in Table 1, the
RTS seems to achieve the best results all across the board,
                                                                                    𝑛           𝑛         𝑛
but since we are working with time series the best choice                       𝑛 ∑𝑖=1 𝑥𝑖 𝑦𝑖 − ∑𝑖=1 𝑥𝑖 ∑𝑖=1 𝑦𝑖
would be to not consider this split as it tends to over-        𝑟=
                                                                           𝑛  2         𝑛   2         𝑛2         𝑛  2
estimate the results due to the data nature. Therefore,              √𝑛 ∑𝑖=1 𝑥𝑖 − (∑𝑖=1 𝑥𝑖 ) √𝑛 ∑𝑖=1 𝑦𝑖 − (∑𝑖=1 𝑦𝑖 )
the split technique considered in the next steps of this                                                            (2)
research is the RDS.                                             The Pearson Correlation Coefficient (r) was utilized
                                                              to assess these correlations, as indicated by Equation 2.
Consequently, even if a negative correlation with the           Table 3
target variable is obtained using this formula, it remains      Best transformation for each feature
valuable as it signifies an inverse correlation, akin to                Feature              Best Transformation
inverse proportionality. Ultimately, the features selected
by this method are those for which |r| > 0.1.                           pm1                  Log Transformation
                                                                        pm2p5_x              Log Transformation
                                                                        pm2p5_y              QuantileTransformer
3.3.2. Skewness Transformation                                          pm4                  Log Transformation
Skewness is a statistical measure that describes the asym-              pm10                 Log Transformation
                                                                        relative_humidity    QuantileTransformer
metry of the probability distribution of a real-valued
                                                                        temperature          QuantileTransformer
random variable. In simpler terms, it measures the de-                  wind_speed           QuantileTransformer
gree and direction of skew (departure from horizontal                   pressure             QuantileTransformer
symmetry) in a dataset. A skewness value of 0 indicates                 month                QuantileTransformer
a perfectly symmetrical distribution, see Eq. 3. Positive
skewness indicates a longer right tail, while negative
skewness indicates a longer left tail.
                                       𝑛
                            𝑛            𝑥 − 𝑥̄ 3
        Skewness =                   ∑( 𝑖      )         (3)
                      (𝑛 − 1)(𝑛 − 2) 𝑖=1   𝑠
   When dealing with regression problems, addressing
highly skewed variables is crucial as they can impact
the model’s fit. This is primarily due to the assumption
of linearity made by most regression algorithms, which                    (a) Before                     (b) After
presupposes linear relationships between features. By
applying transformations such as power or logarithmic           Figure 2: Distribution comparison with skewness transformer
functions, this effect can be mitigated, especially consider-
ing that the chosen model inherently possesses nonlinear
properties.                                                     4. Results and Discussion
Additionally, highly skewed predictor variables can make
the model overly sensitive to extremely high values, po-        By applying all the aforementioned techniques, the final
tentially resulting in a poor fit for the majority of the       pipeline is created and then trained on the preprocessed
data.                                                           Turin dataset with the RDS split method.
To tackle this issue, a skewness transformation was in-
corporated into the pipeline. This transformation applies       Table 4
a predefined set of transformations to each feature in          Performance metrics obtained from training the LightGBM
order to reduce its skewness. The set of transformations        model on the Turin preprocessed dataset.
includes:
                                                                          Metric       Turin Train     Turin Test
     • Logarithm: 𝑓𝑡 = log(𝑓 );                                           MAE            0.3023          0.3369
     • Exponential: 𝑓𝑡 = 𝑒 𝑓 ;                                            RMSE           0.1467          0.1846
                                                                          MDAE           0.2508          0.2775
     • Square Root: 𝑓𝑡 = √𝑓;
                                                                          𝑅2 Score       0.7435          0.6735
     • Quantile: 𝑓𝑡 = 𝐹 −1 (𝑓 );

   For each feature in the dataset, all transformations are        As evident from Table 4, the selected pipeline demon-
tested, and the one selected is the transformation that         strates strong performance on both the Turin training
minimizes the feature’s skewness to 0.                          and test sets.
   An example of feature skewness transformation is de-            In Figure 3, the feature importance ranking for the con-
picted in Figure 2, illustrating the distribution of tem-       structed model is depicted. Observing the significance
perature data. The second figure demonstrates the at-           of meteorological features for the model’s predictions is
tainment of a Gaussian distribution after applying the          notable.
Quantile Transformer. Table 3 shows the best transfor-             The results presented in Table 5 highlight the
mation found for each feature.                                  performance achieved when applying the model to
                                                                a distinct dataset, the Southampton dataset. Here, it
                                                                is evident that the model’s prediction of outcomes is
                                                                unsatisfactory. This suggests that while the model
                                                               Table 6
                                                               Performance Metrics
                                                                  Metric              MAE     RMSE      MdAE        R2
                                                                  Merged Dataset      3.52      5.78      2.08     0.78


Figure 3: Feature importance ranking


Table 5
Southampton (UK) performance metrics obtained from train-
ing the LightGBM model on the Turin preprocessed dataset.

                 Metric      UK Dataset
                 MAE             6.4039
                 RMSE           82.9644
                 MDAE            4.5478                        Figure 4: Bland–Altman plot for the merged dataset
                 𝑅2 Score       -0.9130


                                                                  However, upon analyzing the Bland-Altman plot in
reliably predicts where results should fall within their
                                                               Figure 4, it becomes apparent that there exist relatively
value range, it struggles to accurately forecast how
                                                               high absolute differences between the predicted and
they are distributed over time. Consequently, it can be
                                                               actual values, particularly within the first range of values
inferred that the geographic location under study exerts
                                                               where the majority of records are concentrated. This
a significant influence on PM forecasting.
                                                               discrepancy implies that while the predictions generally
To tailor forecasting models to specific geographic
                                                               fall within the desired range considering the wide scope
zones, it is essential to incorporate the studied area
                                                               of values (over 87k records), the model’s precision in
as a feature or consider creating independent models
                                                               predicting exact values is suboptimal.
for each area under consideration. The challenge
                                                               One possible explanation for this phenomenon is the
faced by the model in this scenario may stem from
                                                               variability of PM values across different geographical
several factors, including the distinct nature of the
                                                               areas attributable to diverse environmental conditions.
datasets, their unique contextual considerations, and the
                                                               Without incorporating a feature that delineates between
temporal misalignment despite both datasets covering
                                                               the two areas, the model treats the PM range as a unified
an entire year. Furthermore, the placement of the
                                                               domain for both datasets, endeavouring to predict within
SPS30 sensors within different devices for Southamp-
                                                               that domain without differentiation due to the absence
ton and Turin introduces significant variability in the
                                                               of pertinent information. These findings underscore the
collected data due to positional and rotational differences.
                                                               original hypothesis, emphasizing the necessity to either
                                                               incorporate features that encapsulate environmental
To delve deeper into this issue, an additional test
                                                               conditions or devise distinct models for different areas,
was performed by merging records from both the
                                                               as the available features alone are insufficient to infer
Southampton and Turin datasets. This merged dataset
                                                               such information.
served as the comprehensive training and testing dataset
with the RDS split and was subsequently processed
                                                               To conclude this discussion and affirm the thesis,
through the aforementioned pipeline. The objective of
                                                               a final test was conducted by creating a new independent
this test was to develop a model capable of addressing
                                                               model using only the Southampton data.
both challenges simultaneously, by incorporating data
                                                                  The latest results presented in Table 7 serve to rein-
from both geographical areas concurrently.
                                                               force the thesis that tailoring a model to a specific geo-
   As we can see from the results in Table 6, this test
                                                               graphical area yields superior outcomes in accurately cap-
provided surprisingly good results all across the board,
                                                               turing and predicting PM levels using machine learning
with great values both in the distance metrics and in R2.
Table 7                                                       [3] X. Yue, Y. Hu, C. Tian, R. Xu, W. Yu, Y. Guo, In-
Performance metrics for Southampton model                         creasing impacts of fire air pollution on public and
                                                                  ecosystem health, The Innovation 5 (2024) 100609.
 Metric                 MAE      RMSE      MdAE      R2
                                                              [4] D. Grantz, J. Garner, D. Johnson, Ecological ef-
 Southampton Model       1.73     3.04      1.01    0.88          fects of particulate matter, Environment Interna-
                                                                  tional 29 (2003) 213–239. doi:https://doi.org/
                                                                  10.1016/S0160- 4120(02)00181- 2 , future Direc-
techniques. The model trained exclusively on Southamp-            tions in Air Quality Research : Ecological,Atmo-
ton data demonstrates excellent performance across all            spheric,Regulatory/Policy/Economic, and Educa-
metrics utilized, consolidating the argument for geo-             tional Issues.
graphic specialization in PM forecasting models.              [5] M. J. Mohammadi, B. F. Dehaghi, S. Mansou-
                                                                  rimoghadam, A. Sharhani, P. Amini, S. Ghan-
                                                                  bari, Cardiovascular disease, mortality and ex-
5. Conclusion                                                     posure to particulate matter (pm): a system-
                                                                  atic review and meta-analysis,          Reviews on
In conclusion, this paper presents a comprehensive study
                                                                  Environmental Health 39 (2024) 141–149. URL:
on the development of the LightGBM model for predict-
                                                                  https://doi.org/10.1515/reveh-2022-0090. doi:doi:
ing PM levels, highlighting the crucial role of geograph-
                                                                  10.1515/reveh- 2022- 0090 .
ical considerations in the process. The study evaluates
                                                              [6] M. Casari, L. Po, L. Zini, Low-cost pm data, 2023.
various dataset split techniques and identifies the RDS
                                                                  URL:      https://doi.org/10.5281/zenodo.10037781.
method as the most effective. The learning pipeline en-
                                                                  doi:10.5281/zenodo.10037781 ,               https://-
compasses feature selection and skewness transforma-
                                                                  doi.org/10.5281/zenodo.10037781.
tion. Remarkably, this pipeline achieves state-of-the-art
                                                              [7] F. M. J. Bulot, Characterisation and calibra-
results on both the Turin and Southampton datasets in-
                                                                  tion of low-cost pm sensors at high tempo-
dependently.
                                                                  ral resolution to reference grade performances -
Furthermore, a comparative analysis is conducted on dif-
                                                                  dataset, 2022. URL: https://doi.org/10.5281/zenodo.
ferent combinations of data, as well as a merged dataset
                                                                  7198378. doi:10.5281/zenodo.7198378 , https://-
test incorporating data from both regions simultaneously.
                                                                  doi.org/10.5281/zenodo.7198378.
However, the findings suggest that creating independent
                                                              [8] M. Casari, L. Po,        Mith: A framework for
models for distinct geographical areas yields the best
                                                                  mitigating hygroscopicity in low-cost pm sen-
performance for this case study, underscoring the sig-
                                                                  sors, Environmental Modelling & Software 173
nificance of environmental conditions surrounding the
                                                                  (2024) 105955. doi:https://doi.org/10.1016/j.
utilized sensor.
                                                                  envsoft.2024.105955 .
This research endeavours to shed light on laying the
                                                              [9] G. Ke, Q. Meng, T. Finley, T. Wang, W. Chen, W. Ma,
groundwork for constructing models capable of general-
                                                                  Q. Ye, T.-Y. Liu, Lightgbm: A highly efficient gra-
izing, taking into account localized environmental factors
                                                                  dient boosting decision tree, in: I. Guyon, U. V.
in the predictive modelling of PM levels.
                                                                  Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vish-
                                                                  wanathan, R. Garnett (Eds.), Advances in Neural
References                                                        Information Processing Systems, volume 30, Cur-
                                                                  ran Associates, Inc., 2017.
 [1] K. R. Daellenbach, G. Uzu, J. Jiang, L.-E. Cassagnes,   [10] T. Chen, C. Guestrin, Xgboost: A scalable tree
     Z. Leni, A. Vlachou, G. Stefenelli, F. Canonaco,             boosting system, in: Proceedings of the 22nd ACM
     S. Weber, A. Segers, J. J. P. Kuenen, M. Schaap,             SIGKDD International Conference on Knowledge
     O. Favez, A. Albinet, S. Aksoyoglu, J. Dommen,               Discovery and Data Mining, KDD ’16, Association
     U. Baltensperger, M. Geiser, I. El Haddad, J.-L. Jaf-        for Computing Machinery, New York, NY, USA,
     frezo, A. S. H. Prévôt, Sources of particulate-              2016, p. 785–794. doi:10.1145/2939672.2939785 .
     matter air pollution and its oxidative potential in
     europe, Nature 587 (2020) 414 – 419. doi:10.1038/
     s41586- 020- 2902- 8 , all Open Access, Green Open
     Access.
                                                           A. Online Resources
 [2] A. Mukherjee, M. Agrawal, World air particulate The Turin dataset used in this study is freely available
     matter: sources, distribution and health effects, En- through the Zenodo platform [6].
     vironmental chemistry letters 15 (2017) 283–309.
     doi:10.1007/s10311- 017- 0611- 9 .