=Paper=
{{Paper
|id=Vol-3762/557
|storemode=property
|title=A Comparative Study of LightGBM on Air Quality Data Across Multiple Locations
|pdfUrl=https://ceur-ws.org/Vol-3762/557.pdf
|volume=Vol-3762
|authors=Martina Casari,Laura Po,Andrea Arigliano
|dblpUrl=https://dblp.org/rec/conf/ital-ia/CasariPA24
}}
==A Comparative Study of LightGBM on Air Quality Data Across Multiple Locations==
A Comparative Study of LightGBM on Air Quality Data
Across Multiple Locations
Martina Casari1,∗ , Andrea Arigliano1 and Laura Po1
1
Department of Engineering ”Enzo Ferrari”, University of Modena and Reggio Emilia
Abstract
In this paper, we present a novel approach utilizing LightGBM algorithms to estimate PM2.5 concentrations in two distinct
geographical locations, Turin in Italy and Southampton in the UK. Our methodology integrates data from low-cost sensors
co-located with reference stations in both locations, ensuring data reliability. Through a rigorous analysis encompassing
diverse splitting techniques, learning pipeline components, and feature selection methods, our approach showcases remarkable
performance across various scenarios, promising practical applicability. We initially train and test our model on the Turin
dataset, followed by an assessment of its performance within the specific geographical context. Furthermore, we extend our
investigation to the Southampton dataset without any adjustments, revealing disparities in performance. Additionally, we
conduct comparative training on both datasets, offering insights into contextual factors influencing model efficacy within
specific geographical areas. Our findings underscore the importance of contextual considerations for accurate air quality
estimation and highlight the potential of our approach for real-world deployment. The datasets used in this study are publicly
available, facilitating further research and validation.
Keywords
Particulate Matter, Low-cost sensors, Different Locations, LightGBM, Open dataset
1. Introduction which is referred to as the dose. Studies examining this
dose have shown that exposure to high concentrations of
Airborne particulate matter (PM) refers to tiny particles PM can lead to damage at the cellular level, particularly
in the air that can be composed of various materials such in the lungs. Several possible reactions can occur in re-
as dust, dirt, soot, smoke, and liquid droplets. These par- sponse to certain environmental and chemical exposures.
ticles vary in size and can have different chemical compo- One hypothesis is that the body may up-regulate its pro-
sitions, originating from both natural and human-made duction of antioxidant enzymes to combat the negative
sources [1]. Airborne PM consists of a heterogeneous effects of these exposures. In some cases, exposure can
mixture of solid and liquid particles suspended in air that also result in cell death or an allergic immune response.
varies continuously in size and chemical composition in Additionally, exposure can impair the body’s ability to
space and time. PM is categorized based on the diameter defend the lungs and cause DNA damage. It’s important
of the particles, measured in micrometres (𝜇𝑚) [2]. The to note that these effects can also have a ripple effect
main classifications include PM1, PM2.5, PM4, and PM10, throughout the body, impacting other systems such as
representing different size fractions, each of them caus- the cardiovascular system. Given the detrimental impact
ing different problems regarding both the environmental that PM concentration in the atmosphere can have, ac-
conditions, affecting ecosystems [3, 4], and human health curately forecasting future PM levels based on current
[5] complications which mainly impact the respiratory air conditions is a critical undertaking. This effort is es-
and cardiovascular systems, also potentially affecting the sential in preventing the various issues associated with
bloodstream. Airborne PM can have severe environmen- PM exposure and implementing effective measures like
tal consequences. When it settles on the soil, it can have traffic and viability restrictions to address them.
a detrimental impact on the nutrient cycling of plants The objective of this study is to demonstrate the effective-
and disrupt the ecosystem’s balance. This can potentially ness of the LightGBM algorithm in accurately forecast-
lead to negative consequences on the entire food chain ing PM2.5 levels using cost-effective sensors and various
and have long-lasting effects on the environment. When environmental parameters. Additionally, the study ex-
it comes to health concerns, much attention has been plores the applicability of the method across different
given to the amount of PM that enters a person’s body, locations, examining both homogeneous and heteroge-
neous approaches. The training process relies on PM2.5
Ital-IA 2024: 4th National Conference on Artificial Intelligence, orga-
measurements from reference stations, enabling the resul-
nized by CINI, May 29-30, 2024, Naples, Italy
∗
Corresponding author. tant model to predict and adjust measurement readings
Envelope-Open martina.casari@unimore.it (M. Casari); laura.po@unimore.it effectively.
(L. Po) The article is structured as follows: Section 2 introduces
Orcid 0000-0003-0406-3036 (M. Casari); 0000-0002-3345-176X (L. Po) the dataset; Section 3 outlines the methodology, includ-
© 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License
Attribution 4.0 International (CC BY 4.0).
CEUR
ceur-ws.org
Workshop ISSN 1613-0073
Proceedings
ing the models used and the pipeline implemented; Sec-
tion 4 presents the results and discussion; and Section 5
provides the conclusions.
2. Datasets
The datasets considered in this study are created by a
collection of measurements captured in two different
geographical areas, both by using SPS30 low-cost (LC)
sensors as input and the co-located legal stations as ref-
erence:
• Turin (Italy): LC sensors capturing records with
15-minute frequency, reference station (RS) with
hourly frequency based on Arpa weather stations
[6];
• Southampton (UK): LC sensors capturing records Figure 1: Correlation matrix of all the features.
with 2 minutes frequency, RS sensors with hourly
frequency based on Fidas200s weather stations
[7].
3. Methodology
The data was obtained through individual sensor
The research consisted of a methodical process with dis-
measurements, which were then used to construct
tinct stages. Firstly, a brute-force testing procedure was
the raw datasets for both Turin and Southampton.
carried out to determine the most appropriate machine-
Subsequently, a thorough analysis of the LC and RS data
learning model from a variety of options. Subsequently,
was conducted to create a dataset linking each reference
the pipeline was created by examining the ideal dataset
record with a low-cost measurement. To achieve this,
split, feature selection, and transformation techniques re-
the input datasets were resampled to match the hourly
quired for the specific task. Lastly, a thorough evaluation
frequency of the reference datasets.
of performance metrics was conducted using the Turin
Initially, the resampling technique employed was
dataset, including MAE, MSE, MdAE, and R2 metrics.
averaging all the LC data over the RS hourly record.
However, due to significant variations in the data within
an hour, it was decided to assign the closest available 3.1. Model
LC record to each RS record instead. After this process,
The first step was to determine the appropriate model
the raw datasets for both Turin and Southampton were
for the problem at hand. To accomplish this, a Bulk Re-
created, and preprocessing techniques [8] were applied
gressor was implemented. This function tests a variety
to uniformly adjust the data, preparing them for the
of regression models from popular Python libraries, such
training step. In the performance evaluation, just the
as scikitlearn, on the target dataset, ultimately produc-
preprocessed dataset was considered for comparison.
ing a ranking of the most successful models based on
Incorporating contextual features based on time into
average prediction accuracy metrics. Interestingly, the
the feature extraction process has allowed for a more
top-performing models were nonlinear, indicating that
thorough understanding of the data. This approach not
interpreting the features required an examination of non-
only captures the original features but also encodes
linear relationships between them. As a result, LightGBM
information about the time axis, enabling a fine and
was chosen as the model for this study.
accurate representation of patterns that unfold over time.
LightGBM (Light Gradient Boosting Machine) is a power-
Ultimately, this results in more insightful and precise
ful and efficient gradient-boosting framework developed
outcomes.
by Microsoft researchers in 2017 [9]. It is designed to be
efficient and scalable, making it particularly well-suited
The final set of features included in the datasets
for large datasets and high-dimensional feature spaces.
comprises ”pm1,” ”pm2p5,” ”pm2p5 RF target,” ”pm4,”
It utilizes the boosting framework, building an ensemble
”pm10,” ”wind speed,” ”pressure,” ”temperature,” ”relative
of weak learners (decision trees) sequentially to mini-
humidity,” ”month,” ”day of the week,” and ”hour.” The
mize the overall prediction error, thus ultimately combin-
correlation matrix is depicted in Figure 1.
ing multiple weak models to create a strong predictive
model. Unlike depth-first tree growth in traditional gra-
dient boosting frameworks like XGBoost [10], LightGBM Table 1
adopts a leaf-wise tree growth strategy which chooses Dataset split with performance metrics over the preprocessed
the leaf with the maximum delta loss to grow, which can Turin dataset
lead to faster convergence and reduced computational MAE RMSE MdAE R2
cost. The trees are then used as usual, choosing the path
that maximizes the information gain which is evaluated RTS 5.19 7.24 3.96 0.73
RDS 5.16 7.38 3.90 0.73
via the variance score of each node. Other character-
RMS 5.21 7.25 3.97 0.73
istics are that it includes a feature selection process by FDS 6.86 8.87 5.82 0.62
itself and the loss used usually is the Mean Squared Error FMS 5.91 8.58 4.18 0.58
(MSE) Loss, Eq. 1.
𝑛
1 Table 2
𝑀𝑆𝐸 = ∑(𝑦 − 𝑦𝑖̂ )2 (1)
𝑛 𝑖=1 𝑖 Correlation of features with target variable
Feature Absolute Correlation
3.2. Split Techniques pm1 0.588402
Different split configurations were tested in order to ob- pm2p5 0.546849
tain the optimal one for this case study, starting from a pm4 0.521004
simple random split and going towards more complex pm10 0.511054
temperature -0.473645
splits based on the time period considered. The different
pressure 0.306547
splits considered are: relative humidity 0.297370
wind speed -0.227094
• Random Total Split (RTS): Random split among month -0.159161
all the records in the domain of the whole dataset; day of the week -0.014929
• Random Day Split (RDS): Random split obtained hour -0.006393
by grouping all the records by day, then randomly
splitting in the subdomain of the single day;
• Random Month Split (RMS): Random split ob- 3.3. Pipeline
tained by grouping all the records by month, then
randomly splitting in the subdomain of the single After selecting the model and dataset split method, the
month; subsequent task involves determining the required data
• Forecast Day Split (FDS): Forecast split obtained preparation techniques for this problem. The primary
by grouping all the records by day, then assigning components of the data processing pipeline include fea-
the first 75% to the train and the last 25% to the ture selection and skewness transformation. It is un-
test in the subdomain of the single day; necessary to scale the data given the characteristics of
LightGBM.
• Forecast Month Split (FMS): Forecast split ob-
tained by grouping all the records by month, then
assigning the first 75% to the train and the last 3.3.1. Feature Selection
25% to the test in the subdomain of the single In this phase, the most representative features of the
month; problem were extracted. Since there is a relatively low
number of features, to begin with, the selection was done
Every split considered kept a 75-25 ratio between the using a simple correlation method where the resulting
training and test set, simply varying the domain consid- features are the ones which correlate with the target
ered and whether the records were picked randomly or variable higher than a chosen threshold.
sequentially. Each of the aforementioned split techniques As evident from Table 2, both ”day of the week” and
was tested over the preprocessed Turin dataset to choose ”month” exhibit weak correlations with the target vari-
the best-performing split for the next steps. able.
As it is possible to infer from results in Table 1, the
RTS seems to achieve the best results all across the board,
𝑛 𝑛 𝑛
but since we are working with time series the best choice 𝑛 ∑𝑖=1 𝑥𝑖 𝑦𝑖 − ∑𝑖=1 𝑥𝑖 ∑𝑖=1 𝑦𝑖
would be to not consider this split as it tends to over- 𝑟=
𝑛 2 𝑛 2 𝑛2 𝑛 2
estimate the results due to the data nature. Therefore, √𝑛 ∑𝑖=1 𝑥𝑖 − (∑𝑖=1 𝑥𝑖 ) √𝑛 ∑𝑖=1 𝑦𝑖 − (∑𝑖=1 𝑦𝑖 )
the split technique considered in the next steps of this (2)
research is the RDS. The Pearson Correlation Coefficient (r) was utilized
to assess these correlations, as indicated by Equation 2.
Consequently, even if a negative correlation with the Table 3
target variable is obtained using this formula, it remains Best transformation for each feature
valuable as it signifies an inverse correlation, akin to Feature Best Transformation
inverse proportionality. Ultimately, the features selected
by this method are those for which |r| > 0.1. pm1 Log Transformation
pm2p5_x Log Transformation
pm2p5_y QuantileTransformer
3.3.2. Skewness Transformation pm4 Log Transformation
Skewness is a statistical measure that describes the asym- pm10 Log Transformation
relative_humidity QuantileTransformer
metry of the probability distribution of a real-valued
temperature QuantileTransformer
random variable. In simpler terms, it measures the de- wind_speed QuantileTransformer
gree and direction of skew (departure from horizontal pressure QuantileTransformer
symmetry) in a dataset. A skewness value of 0 indicates month QuantileTransformer
a perfectly symmetrical distribution, see Eq. 3. Positive
skewness indicates a longer right tail, while negative
skewness indicates a longer left tail.
𝑛
𝑛 𝑥 − 𝑥̄ 3
Skewness = ∑( 𝑖 ) (3)
(𝑛 − 1)(𝑛 − 2) 𝑖=1 𝑠
When dealing with regression problems, addressing
highly skewed variables is crucial as they can impact
the model’s fit. This is primarily due to the assumption
of linearity made by most regression algorithms, which (a) Before (b) After
presupposes linear relationships between features. By
applying transformations such as power or logarithmic Figure 2: Distribution comparison with skewness transformer
functions, this effect can be mitigated, especially consider-
ing that the chosen model inherently possesses nonlinear
properties. 4. Results and Discussion
Additionally, highly skewed predictor variables can make
the model overly sensitive to extremely high values, po- By applying all the aforementioned techniques, the final
tentially resulting in a poor fit for the majority of the pipeline is created and then trained on the preprocessed
data. Turin dataset with the RDS split method.
To tackle this issue, a skewness transformation was in-
corporated into the pipeline. This transformation applies Table 4
a predefined set of transformations to each feature in Performance metrics obtained from training the LightGBM
order to reduce its skewness. The set of transformations model on the Turin preprocessed dataset.
includes:
Metric Turin Train Turin Test
• Logarithm: 𝑓𝑡 = log(𝑓 ); MAE 0.3023 0.3369
• Exponential: 𝑓𝑡 = 𝑒 𝑓 ; RMSE 0.1467 0.1846
MDAE 0.2508 0.2775
• Square Root: 𝑓𝑡 = √𝑓;
𝑅2 Score 0.7435 0.6735
• Quantile: 𝑓𝑡 = 𝐹 −1 (𝑓 );
For each feature in the dataset, all transformations are As evident from Table 4, the selected pipeline demon-
tested, and the one selected is the transformation that strates strong performance on both the Turin training
minimizes the feature’s skewness to 0. and test sets.
An example of feature skewness transformation is de- In Figure 3, the feature importance ranking for the con-
picted in Figure 2, illustrating the distribution of tem- structed model is depicted. Observing the significance
perature data. The second figure demonstrates the at- of meteorological features for the model’s predictions is
tainment of a Gaussian distribution after applying the notable.
Quantile Transformer. Table 3 shows the best transfor- The results presented in Table 5 highlight the
mation found for each feature. performance achieved when applying the model to
a distinct dataset, the Southampton dataset. Here, it
is evident that the model’s prediction of outcomes is
unsatisfactory. This suggests that while the model
Table 6
Performance Metrics
Metric MAE RMSE MdAE R2
Merged Dataset 3.52 5.78 2.08 0.78
Figure 3: Feature importance ranking
Table 5
Southampton (UK) performance metrics obtained from train-
ing the LightGBM model on the Turin preprocessed dataset.
Metric UK Dataset
MAE 6.4039
RMSE 82.9644
MDAE 4.5478 Figure 4: Bland–Altman plot for the merged dataset
𝑅2 Score -0.9130
However, upon analyzing the Bland-Altman plot in
reliably predicts where results should fall within their
Figure 4, it becomes apparent that there exist relatively
value range, it struggles to accurately forecast how
high absolute differences between the predicted and
they are distributed over time. Consequently, it can be
actual values, particularly within the first range of values
inferred that the geographic location under study exerts
where the majority of records are concentrated. This
a significant influence on PM forecasting.
discrepancy implies that while the predictions generally
To tailor forecasting models to specific geographic
fall within the desired range considering the wide scope
zones, it is essential to incorporate the studied area
of values (over 87k records), the model’s precision in
as a feature or consider creating independent models
predicting exact values is suboptimal.
for each area under consideration. The challenge
One possible explanation for this phenomenon is the
faced by the model in this scenario may stem from
variability of PM values across different geographical
several factors, including the distinct nature of the
areas attributable to diverse environmental conditions.
datasets, their unique contextual considerations, and the
Without incorporating a feature that delineates between
temporal misalignment despite both datasets covering
the two areas, the model treats the PM range as a unified
an entire year. Furthermore, the placement of the
domain for both datasets, endeavouring to predict within
SPS30 sensors within different devices for Southamp-
that domain without differentiation due to the absence
ton and Turin introduces significant variability in the
of pertinent information. These findings underscore the
collected data due to positional and rotational differences.
original hypothesis, emphasizing the necessity to either
incorporate features that encapsulate environmental
To delve deeper into this issue, an additional test
conditions or devise distinct models for different areas,
was performed by merging records from both the
as the available features alone are insufficient to infer
Southampton and Turin datasets. This merged dataset
such information.
served as the comprehensive training and testing dataset
with the RDS split and was subsequently processed
To conclude this discussion and affirm the thesis,
through the aforementioned pipeline. The objective of
a final test was conducted by creating a new independent
this test was to develop a model capable of addressing
model using only the Southampton data.
both challenges simultaneously, by incorporating data
The latest results presented in Table 7 serve to rein-
from both geographical areas concurrently.
force the thesis that tailoring a model to a specific geo-
As we can see from the results in Table 6, this test
graphical area yields superior outcomes in accurately cap-
provided surprisingly good results all across the board,
turing and predicting PM levels using machine learning
with great values both in the distance metrics and in R2.
Table 7 [3] X. Yue, Y. Hu, C. Tian, R. Xu, W. Yu, Y. Guo, In-
Performance metrics for Southampton model creasing impacts of fire air pollution on public and
ecosystem health, The Innovation 5 (2024) 100609.
Metric MAE RMSE MdAE R2
[4] D. Grantz, J. Garner, D. Johnson, Ecological ef-
Southampton Model 1.73 3.04 1.01 0.88 fects of particulate matter, Environment Interna-
tional 29 (2003) 213–239. doi:https://doi.org/
10.1016/S0160- 4120(02)00181- 2 , future Direc-
techniques. The model trained exclusively on Southamp- tions in Air Quality Research : Ecological,Atmo-
ton data demonstrates excellent performance across all spheric,Regulatory/Policy/Economic, and Educa-
metrics utilized, consolidating the argument for geo- tional Issues.
graphic specialization in PM forecasting models. [5] M. J. Mohammadi, B. F. Dehaghi, S. Mansou-
rimoghadam, A. Sharhani, P. Amini, S. Ghan-
bari, Cardiovascular disease, mortality and ex-
5. Conclusion posure to particulate matter (pm): a system-
atic review and meta-analysis, Reviews on
In conclusion, this paper presents a comprehensive study
Environmental Health 39 (2024) 141–149. URL:
on the development of the LightGBM model for predict-
https://doi.org/10.1515/reveh-2022-0090. doi:doi:
ing PM levels, highlighting the crucial role of geograph-
10.1515/reveh- 2022- 0090 .
ical considerations in the process. The study evaluates
[6] M. Casari, L. Po, L. Zini, Low-cost pm data, 2023.
various dataset split techniques and identifies the RDS
URL: https://doi.org/10.5281/zenodo.10037781.
method as the most effective. The learning pipeline en-
doi:10.5281/zenodo.10037781 , https://-
compasses feature selection and skewness transforma-
doi.org/10.5281/zenodo.10037781.
tion. Remarkably, this pipeline achieves state-of-the-art
[7] F. M. J. Bulot, Characterisation and calibra-
results on both the Turin and Southampton datasets in-
tion of low-cost pm sensors at high tempo-
dependently.
ral resolution to reference grade performances -
Furthermore, a comparative analysis is conducted on dif-
dataset, 2022. URL: https://doi.org/10.5281/zenodo.
ferent combinations of data, as well as a merged dataset
7198378. doi:10.5281/zenodo.7198378 , https://-
test incorporating data from both regions simultaneously.
doi.org/10.5281/zenodo.7198378.
However, the findings suggest that creating independent
[8] M. Casari, L. Po, Mith: A framework for
models for distinct geographical areas yields the best
mitigating hygroscopicity in low-cost pm sen-
performance for this case study, underscoring the sig-
sors, Environmental Modelling & Software 173
nificance of environmental conditions surrounding the
(2024) 105955. doi:https://doi.org/10.1016/j.
utilized sensor.
envsoft.2024.105955 .
This research endeavours to shed light on laying the
[9] G. Ke, Q. Meng, T. Finley, T. Wang, W. Chen, W. Ma,
groundwork for constructing models capable of general-
Q. Ye, T.-Y. Liu, Lightgbm: A highly efficient gra-
izing, taking into account localized environmental factors
dient boosting decision tree, in: I. Guyon, U. V.
in the predictive modelling of PM levels.
Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vish-
wanathan, R. Garnett (Eds.), Advances in Neural
References Information Processing Systems, volume 30, Cur-
ran Associates, Inc., 2017.
[1] K. R. Daellenbach, G. Uzu, J. Jiang, L.-E. Cassagnes, [10] T. Chen, C. Guestrin, Xgboost: A scalable tree
Z. Leni, A. Vlachou, G. Stefenelli, F. Canonaco, boosting system, in: Proceedings of the 22nd ACM
S. Weber, A. Segers, J. J. P. Kuenen, M. Schaap, SIGKDD International Conference on Knowledge
O. Favez, A. Albinet, S. Aksoyoglu, J. Dommen, Discovery and Data Mining, KDD ’16, Association
U. Baltensperger, M. Geiser, I. El Haddad, J.-L. Jaf- for Computing Machinery, New York, NY, USA,
frezo, A. S. H. Prévôt, Sources of particulate- 2016, p. 785–794. doi:10.1145/2939672.2939785 .
matter air pollution and its oxidative potential in
europe, Nature 587 (2020) 414 – 419. doi:10.1038/
s41586- 020- 2902- 8 , all Open Access, Green Open
Access.
A. Online Resources
[2] A. Mukherjee, M. Agrawal, World air particulate The Turin dataset used in this study is freely available
matter: sources, distribution and health effects, En- through the Zenodo platform [6].
vironmental chemistry letters 15 (2017) 283–309.
doi:10.1007/s10311- 017- 0611- 9 .