<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Comparative Study of LightGBM on Air Quality Data Across Multiple Locations</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Martina Casari</string-name>
          <email>martina.casari@unimore.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Andrea Arigliano</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Laura Po</string-name>
          <email>laura.po@unimore.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Engineering ”Enzo Ferrari”, University of Modena and Reggio Emilia</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Particulate Matter</institution>
          ,
          <addr-line>Low-cost sensors, Diferent Locations, LightGBM, Open dataset</addr-line>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2024</year>
      </pub-date>
      <fpage>29</fpage>
      <lpage>30</lpage>
      <abstract>
        <p>In this paper, we present a novel approach utilizing LightGBM algorithms to estimate PM2.5 concentrations in two distinct geographical locations, Turin in Italy and Southampton in the UK. Our methodology integrates data from low-cost sensors co-located with reference stations in both locations, ensuring data reliability. Through a rigorous analysis encompassing diverse splitting techniques, learning pipeline components, and feature selection methods, our approach showcases remarkable performance across various scenarios, promising practical applicability. We initially train and test our model on the Turin dataset, followed by an assessment of its performance within the specific geographical context. Furthermore, we extend our investigation to the Southampton dataset without any adjustments, revealing disparities in performance. Additionally, we conduct comparative training on both datasets, ofering insights into contextual factors influencing model eficacy within specific geographical areas. Our findings underscore the importance of contextual considerations for accurate air quality estimation and highlight the potential of our approach for real-world deployment. The datasets used in this study are publicly available, facilitating further research and validation.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>Airborne particulate matter (PM) refers to tiny particles
in the air that can be composed of various materials such
ticles vary in size and can have diferent chemical
compositions, originating from both natural and human-made
sources [1]. Airborne PM consists of a heterogeneous
mixture of solid and liquid particles suspended in air that
varies continuously in size and chemical composition in
space and time. PM is categorized based on the diameter
of the particles, measured in micrometres (
) [2]. The
main classifications include PM1, PM2.5, PM4, and PM10,
representing diferent size fractions, each of them
causing diferent problems regarding both the environmental
conditions, afecting ecosystems [ 3, 4], and human health
[5] complications which mainly impact the respiratory
and cardiovascular systems, also potentially afecting the
tal consequences. When it settles on the soil, it can have
a detrimental impact on the nutrient cycling of plants
and disrupt the ecosystem’s balance. This can potentially
lead to negative consequences on the entire food chain
and have long-lasting efects on the environment. When
it comes to health concerns, much attention has been
nEvelop-O
(L. Po)
Ital-IA 2024: 4th National Conference on Artificial Intelligence,
orga∗Corresponding author.
CEUR</p>
      <p>ceur-ws.org
ing the models used and the pipeline implemented;
Section 4 presents the results and discussion; and Section 5
provides the conclusions.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Datasets</title>
      <p>The datasets considered in this study are created by a
collection of measurements captured in two diferent
geographical areas, both by using SPS30 low-cost (LC)
sensors as input and the co-located legal stations as
reference:
• Turin (Italy): LC sensors capturing records with
15-minute frequency, reference station (RS) with
hourly frequency based on Arpa weather stations
[6];
• Southampton (UK): LC sensors capturing records
with 2 minutes frequency, RS sensors with hourly
frequency based on Fidas200s weather stations
[7].</p>
    </sec>
    <sec id="sec-3">
      <title>3. Methodology</title>
      <p>The data was obtained through individual sensor
measurements, which were then used to construct The research consisted of a methodical process with
disthe raw datasets for both Turin and Southampton. tinct stages. Firstly, a brute-force testing procedure was
Subsequently, a thorough analysis of the LC and RS data carried out to determine the most appropriate
machinewas conducted to create a dataset linking each reference learning model from a variety of options. Subsequently,
record with a low-cost measurement. To achieve this, the pipeline was created by examining the ideal dataset
the input datasets were resampled to match the hourly split, feature selection, and transformation techniques
refrequency of the reference datasets. quired for the specific task. Lastly, a thorough evaluation
Initially, the resampling technique employed was of performance metrics was conducted using the Turin
averaging all the LC data over the RS hourly record. dataset, including MAE, MSE, MdAE, and R2 metrics.
However, due to significant variations in the data within
an hour, it was decided to assign the closest available 3.1. Model
LC record to each RS record instead. After this process,
the raw datasets for both Turin and Southampton were
created, and preprocessing techniques [8] were applied
to uniformly adjust the data, preparing them for the
training step. In the performance evaluation, just the
preprocessed dataset was considered for comparison.</p>
      <p>Incorporating contextual features based on time into
the feature extraction process has allowed for a more
thorough understanding of the data. This approach not
only captures the original features but also encodes
information about the time axis, enabling a fine and
accurate representation of patterns that unfold over time.</p>
      <p>Ultimately, this results in more insightful and precise
outcomes.</p>
      <p>The first step was to determine the appropriate model
for the problem at hand. To accomplish this, a Bulk
Regressor was implemented. This function tests a variety
of regression models from popular Python libraries, such
as scikitlearn, on the target dataset, ultimately
producing a ranking of the most successful models based on
average prediction accuracy metrics. Interestingly, the
top-performing models were nonlinear, indicating that
interpreting the features required an examination of
nonlinear relationships between them. As a result, LightGBM
was chosen as the model for this study.</p>
      <p>LightGBM (Light Gradient Boosting Machine) is a
powerful and eficient gradient-boosting framework developed
by Microsoft researchers in 2017 [ 9]. It is designed to be
eficient and scalable, making it particularly well-suited
for large datasets and high-dimensional feature spaces.</p>
      <p>It utilizes the boosting framework, building an ensemble
of weak learners (decision trees) sequentially to
minimize the overall prediction error, thus ultimately
combining multiple weak models to create a strong predictive
model. Unlike depth-first tree growth in traditional
graThe final set of features included in the datasets
comprises ”pm1,” ”pm2p5,” ”pm2p5 RF target,” ”pm4,”
”pm10,” ”wind speed,” ”pressure,” ”temperature,” ”relative
humidity,” ”month,” ”day of the week,” and ”hour.” The
correlation matrix is depicted in Figure 1.
dient boosting frameworks like XGBoost [10], LightGBM
adopts a leaf-wise tree growth strategy which chooses
the leaf with the maximum delta loss to grow, which can
lead to faster convergence and reduced computational
cost. The trees are then used as usual, choosing the path
that maximizes the information gain which is evaluated
via the variance score of each node. Other
characteristics are that it includes a feature selection process by
itself and the loss used usually is the Mean Squared Error
(MSE) Loss, Eq. 1.</p>
      <p>=</p>
      <p>1
 =1</p>
      <p>∑(  −  ̂ )2
3.2. Split Techniques
Diferent split configurations were tested in order to
obtain the optimal one for this case study, starting from a
simple random split and going towards more complex
splits based on the time period considered. The diferent
splits considered are:
• Random Total Split (RTS): Random split among</p>
      <p>all the records in the domain of the whole dataset;
• Random Day Split (RDS): Random split obtained
by grouping all the records by day, then randomly
splitting in the subdomain of the single day;
tained by grouping all the records by month, then
randomly splitting in the subdomain of the single
month;
• Forecast Day Split (FDS): Forecast split obtained
by grouping all the records by day, then assigning
the first 75% to the train and the last 25% to the
test in the subdomain of the single day;
• Forecast Month Split (FMS): Forecast split
obtained by grouping all the records by month, then
assigning the first 75% to the train and the last
25% to the test in the subdomain of the single
month;</p>
      <p>Every split considered kept a 75-25 ratio between the
training and test set, simply varying the domain
considered and whether the records were picked randomly or
sequentially. Each of the aforementioned split techniques
was tested over the preprocessed Turin dataset to choose
the best-performing split for the next steps.</p>
      <p>As it is possible to infer from results in Table 1, the
RTS seems to achieve the best results all across the board,
but since we are working with time series the best choice
would be to not consider this split as it tends to
overestimate the results due to the data nature. Therefore,
the split technique considered in the next steps of this
research is the RDS.</p>
      <p>• Random Month Split (RMS): Random split ob- 3.3. Pipeline
Dataset split with performance metrics over the preprocessed
to assess these correlations, as indicated by Equation 2.</p>
      <p>Consequently, even if a negative correlation with the
target variable is obtained using this formula, it remains
valuable as it signifies an inverse correlation, akin to
inverse proportionality. Ultimately, the features selected
by this method are those for which |r| &gt; 0.1.
3.3.2. Skewness Transformation
Skewness is a statistical measure that describes the
asymmetry of the probability distribution of a real-valued
random variable. In simpler terms, it measures the
degree and direction of skew (departure from horizontal
symmetry) in a dataset. A skewness value of 0 indicates
a perfectly symmetrical distribution, see Eq. 3. Positive
skewness indicates a longer right tail, while negative
skewness indicates a longer left tail.</p>
      <p>Skewness =</p>
      <p>( − 1)( − 2) =1

∑ (   −  ̄)3</p>
      <p>(3)</p>
      <p>When dealing with regression problems, addressing
highly skewed variables is crucial as they can impact
the model’s fit. This is primarily due to the assumption
of linearity made by most regression algorithms, which
presupposes linear relationships between features. By
applying transformations such as power or logarithmic
functions, this efect can be mitigated, especially
considering that the chosen model inherently possesses nonlinear</p>
    </sec>
    <sec id="sec-4">
      <title>4. Results and Discussion</title>
      <p>properties.
data.
includes:
Additionally, highly skewed predictor variables can make
the model overly sensitive to extremely high values, po- By applying all the aforementioned techniques, the final
tentially resulting in a poor fit for the majority of the
To tackle this issue, a skewness transformation was
incorporated into the pipeline. This transformation applies
a predefined set of transformations to each feature in
order to reduce its skewness. The set of transformations
pipeline is created and then trained on the preprocessed
Turin dataset with the RDS split method.
reliably predicts where results should fall within their
value range, it struggles to accurately forecast how
they are distributed over time. Consequently, it can be
inferred that the geographic location under study exerts
a significant influence on PM forecasting.</p>
      <p>To tailor forecasting models to specific geographic
zones, it is essential to incorporate the studied area
as a feature or consider creating independent models
for each area under consideration. The challenge
faced by the model in this scenario may stem from
several factors, including the distinct nature of the
datasets, their unique contextual considerations, and the
temporal misalignment despite both datasets covering
an entire year. Furthermore, the placement of the
SPS30 sensors within diferent devices for
Southampton and Turin introduces significant variability in the
collected data due to positional and rotational diferences.</p>
      <p>To delve deeper into this issue, an additional test
was performed by merging records from both the
Southampton and Turin datasets. This merged dataset
served as the comprehensive training and testing dataset
with the RDS split and was subsequently processed
through the aforementioned pipeline. The objective of To conclude this discussion and afirm the thesis,
this test was to develop a model capable of addressing a final test was conducted by creating a new independent
both challenges simultaneously, by incorporating data model using only the Southampton data.
from both geographical areas concurrently. The latest results presented in Table 7 serve to
rein</p>
      <p>As we can see from the results in Table 6, this test force the thesis that tailoring a model to a specific
geoprovided surprisingly good results all across the board, graphical area yields superior outcomes in accurately
capwith great values both in the distance metrics and in R2. turing and predicting PM levels using machine learning</p>
      <p>However, upon analyzing the Bland-Altman plot in
Figure 4, it becomes apparent that there exist relatively
high absolute diferences between the predicted and
actual values, particularly within the first range of values
where the majority of records are concentrated. This
discrepancy implies that while the predictions generally
fall within the desired range considering the wide scope
of values (over 87k records), the model’s precision in
predicting exact values is suboptimal.</p>
      <p>One possible explanation for this phenomenon is the
variability of PM values across diferent geographical
areas attributable to diverse environmental conditions.</p>
      <p>Without incorporating a feature that delineates between
the two areas, the model treats the PM range as a unified
domain for both datasets, endeavouring to predict within
that domain without diferentiation due to the absence
of pertinent information. These findings underscore the
original hypothesis, emphasizing the necessity to either
incorporate features that encapsulate environmental
conditions or devise distinct models for diferent areas,
as the available features alone are insuficient to infer
such information.
techniques. The model trained exclusively on
Southampton data demonstrates excellent performance across all
metrics utilized, consolidating the argument for
geographic specialization in PM forecasting models.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion</title>
      <p>In conclusion, this paper presents a comprehensive study
on the development of the LightGBM model for
predicting PM levels, highlighting the crucial role of
geographical considerations in the process. The study evaluates
various dataset split techniques and identifies the RDS
method as the most efective. The learning pipeline
encompasses feature selection and skewness
transformation. Remarkably, this pipeline achieves state-of-the-art
results on both the Turin and Southampton datasets
independently.</p>
      <p>Furthermore, a comparative analysis is conducted on
different combinations of data, as well as a merged dataset
test incorporating data from both regions simultaneously.</p>
      <p>However, the findings suggest that creating independent
models for distinct geographical areas yields the best
performance for this case study, underscoring the
significance of environmental conditions surrounding the
utilized sensor.</p>
      <p>This research endeavours to shed light on laying the
groundwork for constructing models capable of
generalizing, taking into account localized environmental factors
in the predictive modelling of PM levels.
[3] X. Yue, Y. Hu, C. Tian, R. Xu, W. Yu, Y. Guo,
Increasing impacts of fire air pollution on public and
ecosystem health, The Innovation 5 (2024) 100609.</p>
      <p>Metric MAE RMSE MdAE R2 [4] D. Grantz, J. Garner, D. Johnson, Ecological
efSouthampton Model 1.73 3.04 1.01 0.88 fects of particulate matter, Environment
International 29 (2003) 213–239. doi:https://doi.org/
10.1016/S0160-4120(02)00181-2, future
Directions in Air Quality Research :
Ecological,Atmospheric,Regulatory/Policy/Economic, and
Educational Issues.
[5] M. J. Mohammadi, B. F. Dehaghi, S.
Mansourimoghadam, A. Sharhani, P. Amini, S.
Ghanbari, Cardiovascular disease, mortality and
exposure to particulate matter (pm): a
systematic review and meta-analysis, Reviews on
Environmental Health 39 (2024) 141–149. URL:
https://doi.org/10.1515/reveh-2022-0090. doi:doi:
10.1515/reveh-2022-0090.
[6] M. Casari, L. Po, L. Zini, Low-cost pm data, 2023.</p>
      <p>URL: https://doi.org/10.5281/zenodo.10037781.
doi:10.5281/zenodo.10037781,
https://doi.org/10.5281/zenodo.10037781.
[7] F. M. J. Bulot, Characterisation and
calibration of low-cost pm sensors at high
temporal resolution to reference grade performances
dataset, 2022. URL: https://doi.org/10.5281/zenodo.
7198378. doi:10.5281/zenodo.7198378,
https://doi.org/10.5281/zenodo.7198378.
[8] M. Casari, L. Po, Mith: A framework for
mitigating hygroscopicity in low-cost pm
sensors, Environmental Modelling &amp; Software 173
(2024) 105955. doi:https://doi.org/10.1016/j.</p>
      <p>envsoft.2024.105955.
[9] G. Ke, Q. Meng, T. Finley, T. Wang, W. Chen, W. Ma,</p>
      <p>Q. Ye, T.-Y. Liu, Lightgbm: A highly eficient
gradient boosting decision tree, in: I. Guyon, U. V.</p>
      <p>Luxburg, S. Bengio, H. Wallach, R. Fergus, S.
Vishwanathan, R. Garnett (Eds.), Advances in Neural
Information Processing Systems, volume 30,
Curran Associates, Inc., 2017.
[1] K. R. Daellenbach, G. Uzu, J. Jiang, L.-E. Cassagnes, [10] T. Chen, C. Guestrin, Xgboost: A scalable tree
Z. Leni, A. Vlachou, G. Stefenelli, F. Canonaco, boosting system, in: Proceedings of the 22nd ACM
S. Weber, A. Segers, J. J. P. Kuenen, M. Schaap, SIGKDD International Conference on Knowledge
O. Favez, A. Albinet, S. Aksoyoglu, J. Dommen, Discovery and Data Mining, KDD ’16, Association
U. Baltensperger, M. Geiser, I. El Haddad, J.-L. Jaf- for Computing Machinery, New York, NY, USA,
frezo, A. S. H. Prévôt, Sources of particulate- 2016, p. 785–794. doi:10.1145/2939672.2939785.
matter air pollution and its oxidative potential in
europe, Nature 587 (2020) 414 – 419. doi:10.1038/
s41586-020-2902-8, all Open Access, Green Open A. Online Resources</p>
      <p>Access.
[2] A. Mukherjee, M. Agrawal, World air particulate The Turin dataset used in this study is freely available
matter: sources, distribution and health efects, En- through the Zenodo platform [6].
vironmental chemistry letters 15 (2017) 283–309.
doi:10.1007/s10311-017-0611-9.</p>
    </sec>
  </body>
  <back>
    <ref-list />
  </back>
</article>