<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Regression analysis as a tool for identifying patterns in atmospheric air monitoring data</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Dmytro V. Shevchenko</string-name>
          <email>dimashevchenko10021999@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Bella L. Holub</string-name>
          <email>bellalg@nubip.edu.ua</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="editor">
          <string-name>PCWrEooUrckResehdoinpgs ISSNc1e6u1r-3w-0s0.o7r3g</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>National University of Life and Environmental Sciences of Ukraine Ukraine</institution>
          ,
          <addr-line>Kyiv, Heroyiv Oborony st., 15, 03041</addr-line>
        </aff>
      </contrib-group>
      <fpage>20</fpage>
      <lpage>27</lpage>
      <abstract>
        <p>This study investigates the application of regression analysis as a tool for identifying patterns in atmospheric air quality monitoring data collected from IoT-based monitoring systems. Given the increasing importance of air pollution control and environmental safety, precise analytical methods are required to assess the relationships between pollutant concentrations and external factors such as meteorological conditions. The research focuses on employing the ordinary least squares (OLS) regression method to analyze the influence of key atmospheric parameters-including temperature, humidity, wind speed, and radiation-on air quality indices, specifically the Air Quality Index (AQI) and the Common Air Quality Index (CAQI). The study is based on data obtained from a network of IoT-enabled sensors deployed across various monitoring stations, which continuously measure air pollutants and meteorological parameters in real-time. A custom-built analytical module was developed to facilitate flexible data selection, enabling comparative assessments across diferent time frames, monitoring stations, and measurement parameters. Through statistical modeling, it was determined that radiation significantly influences both AQI and CAQI, while wind speed has a more pronounced efect during daytime hours. Furthermore, the study revealed that hourly aggregation of pollutant data is optimal for CAQI calculations, whereas daily averages better align with AQI assessments. The findings highlight the advantages of integrating IoT technology with regression-based analysis in modern air quality monitoring systems. By leveraging Python-based statistical tools such as StatsModels, this approach enables the identification of critical environmental factors afecting air pollution levels, thereby supporting more efective predictive modeling and decision-making in environmental policy and urban planning. Future research will explore the integration of nonlinear regression models and time series forecasting to further refine air quality assessments.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;IoT devices</kwd>
        <kwd>edge computing</kwd>
        <kwd>atmospheric air</kwd>
        <kwd>air quality monitoring</kwd>
        <kwd>pollutants</kwd>
        <kwd>data analysis</kwd>
        <kwd>statistical modeling</kwd>
        <kwd>Python</kwd>
        <kwd>StatsModels</kwd>
        <kwd>linear regression</kwd>
        <kwd>ordinary least squares (OLS)</kwd>
        <kwd>Common Air Quality Index (CAQI)</kwd>
        <kwd>Air Quality Index (AQI)</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <sec id="sec-1-1">
        <title>1.1. Relevance</title>
        <p>Atmospheric air is a critically important component of the ecosystem, providing oxygen to all living
organisms and sustaining essential metabolic processes. Air pollution has a significant negative impact
on human health, causing respiratory and cardiovascular diseases, weakening immune responses,
and contributing to the development of chronic illnesses. At the same time, air pollution adversely
afects animals and plants, compromising their health and vitality, which in turn negatively impacts
ecosystems.</p>
        <p>The state of atmospheric air quality is a determining factor for environmental safety and public
health. In Ukraine, air quality monitoring involves the systematic collection, analysis, and processing of
data on pollutant concentrations, serving as a foundation for developing state policies in environmental
protection. Traditionally, monitoring has relied on centralized stations with limited spatial coverage.
However, the advancement of IoT-based monitoring systems has enabled the deployment of low-cost,
real-time air quality sensors, significantly enhancing data granularity and accessibility.</p>
        <p>Modern air quality monitoring systems utilize IoT-enabled devices (or edge devices) to measure
pollutant levels and meteorological parameters continuously. These IoT sensors form a distributed
network, transmitting data to centralized platforms for processing and analysis. This approach allows for
real-time air quality assessment, early warning systems, and better urban environmental management.</p>
        <p>Recognizing these challenges, the development of efective strategies for air quality monitoring and
analysis becomes particularly important. Such approaches enable timely responses to threats, forecasting
the impacts of pollution, and implementing measures to mitigate its adverse efects. The integration
of IoT technology with statistical modeling, such as regression analysis, ofers new opportunities for
identifying hidden dependencies and trends in air pollution data.</p>
      </sec>
      <sec id="sec-1-2">
        <title>1.2. Related work</title>
        <p>
          In recent studies, significant attention has been given to integrating IoT technologies and machine
learning methods for monitoring and predicting air quality. Ángel Martín-Baos et al. [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ] developed
a system combining IoT sensors for collecting pollution and trafic data in urban environments. By
employing linear regression (LR), Gaussian process regression (GPR), and random forests (RF), the
authors constructed models to assess the Air Quality Index (AQI) based on a limited set of parameters,
allowing real-time insights into urban air quality and trafic conditions.
        </p>
        <p>
          Similarly, Banciu et al. [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] proposed an air quality monitoring system that utilizes IoT devices to
collect data on temperature, humidity, and particulate matter (PM10, PM2.5). The collected data is
transmitted to the ThingSpeak cloud platform for storage and preliminary analysis. A regression model
based on TensorFlow is applied for AQI prediction, providing timely alerts and recommendations for
preventive measures.
        </p>
        <p>
          Bobulski et al. [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ] introduced an IoT-based pollution monitoring system with secure data transmission.
This system enables localized real-time air quality monitoring and sends data to users. A key feature of
the system is implementing a secure data transmission protocol, ensuring protection against
cyberattacks and data interception.
        </p>
        <p>
          Ravindra et al. [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ] focused on improving the accuracy of air quality sensors using machine learning
methods. The study demonstrated that applying machine learning models for PM2.5 sensor calibration
significantly enhances the accuracy of air quality monitoring systems.
        </p>
        <p>
          Additionally, Dhanalakshmi and Radha [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] proposed an air pollution forecasting method based on
discretized linear regression and multi-class support vector machines. The proposed IoT system enables
monitoring and controlling air quality within a cloud computing environment, ensuring high prediction
accuracy and reduced data processing time.
        </p>
        <p>These studies highlight the efectiveness of combining IoT technologies and regression analysis for
air quality monitoring and prediction. The integration of modern machine learning methods and secure
data transmission improves the accuracy and reliability of monitoring systems, which is crucial for
making informed decisions in environmental protection and public health. While global studies focus
on improving monitoring accuracy and data transmission security, national-level reports highlight the
limitations of existing monitoring infrastructure, such as in Kyiv, Ukraine.</p>
        <p>
          The 2023 report by the Kyiv City State Administration [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] presents the results of atmospheric air
quality monitoring based on data collected from 46 indicative and 7 reference automatic stations. Despite
a high level of the Common Air Quality Index (CAQI), numerous exceedances of the maximum allowable
concentrations (MAC) for nitrogen oxides were recorded, as well as significant exceedances of daily
average MAC levels for ground-level ozone at nearly all reference stations. A comparison of these
ifndings with the results from the Borys Sreznevskyi Central Geophysical Observatory revealed the
technical obsolescence of the national air monitoring system, which fails to provide comprehensive and
up-to-date information about air quality. This situation highlights the need to modernize the monitoring
system and ensure regular access to data for informed decision-making aimed at improving air quality
in Kyiv.
        </p>
      </sec>
      <sec id="sec-1-3">
        <title>1.3. Objective of the study</title>
        <p>The objective of the study is to utilize regression analysis to identify patterns in atmospheric air quality
monitoring data collected from IoT-based environmental monitoring systems. The research aims to
develop a methodology for analyzing dependencies that will support efective decision-making to
improve the state of atmospheric air.</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>2. System architecture</title>
      <p>The air quality monitoring system is based on IoT-enabled devices that continuously collect real-time
data on key air parameters, including pollutant concentrations, temperature, humidity, and other
environmental indicators.</p>
      <p>The spatial distribution of these IoT sensors plays a crucial role in data reliability. Urban regions with
high trafic density require a denser network of sensors compared to suburban or rural areas, where air
circulation patterns difer significantly. In Kyiv, for example, the placement of sensors near industrial
zones and major highways ensures a more accurate representation of pollution sources.</p>
      <p>These IoT devices serve as edge computing units, transmitting collected data to cloud-based or
on-premise servers for further analysis. The architecture of the monitoring system is illustrated in
ifgure 1.</p>
      <p>The collected data is integrated with other sources, such as municipal monitoring platforms (e.g., Kyiv
City State Administration), community initiatives (SaveEcoBot), and MQTT servers. This multi-source
system enables a comprehensive view of air quality and ensures data relevance. The data is transmitted
to a centralized database built on PostgreSQL, providing reliable storage and accessibility for further
processing.</p>
      <p>This integration allows for cross-validation of sensor readings with oficially recognized monitoring
stations, reducing the impact of sensor biases and ensuring higher data reliability. By leveraging
municipal platforms, real-time data can be incorporated into existing environmental management
systems, facilitating timely responses to pollution spikes.</p>
      <p>An essential component of such systems is the organized storage and processing of datasets. In our
architecture, these tasks are managed using the PostgreSQL database management system (DBMS).
This open-source relational DBMS is well-suited for eficiently managing large datasets and executing
advanced queries. The database’s logical structure is depicted in figure 1.</p>
      <p>A vital element of the system is a set of tables designed to store data related to monitoring stations
and their recorded measurements. The station table holds detailed information about each monitoring
station, including unique identifiers, geographical coordinates, type of data source (e.g., stationary or
mobile), and operational status.</p>
      <p>Based on the accumulated data, two approaches are implemented for assessing air quality: the Air
Quality Index (AQI, USA) and the Common Air Quality Index (CAQI, Europe). Both indices serve as key
indicators of air pollution levels and reflect potential health risks, though there are notable diferences
between them:
• AQI (Air Quality Index): The AQI is calculated based on the concentrations of primary air
pollutants, such as fine particulate matter (PM2.5, PM10), nitrogen dioxide (NO 2), ozone (O3),
sulfur dioxide (SO2), and carbon monoxide (CO). It is widely used to provide a clear assessment
of health risks depending on pollution levels.
• CAQI (Common Air Quality Index): The CAQI is a standard adopted in European cities, also
based on pollutant concentrations, including PM2.5, PM10, NO2, O3, and SO2. This index allows
for a unified evaluation of air quality, which is crucial for regional comparisons.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Results</title>
      <p>
        As part of the air quality monitoring system, a module was developed to enable flexible configuration
of parameters for data analysis using the OLS regression model, implemented via the Statsmodels
library [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], a powerful tool for statistical modeling in Python. This module facilitates the analysis of
dependencies across selected stations, measurement parameters, and various time intervals. The tool
allows for personalized queries through filtering options, as illustrated in figure 3.
      </p>
      <p>The user can specify any time range for analysis, including both daily and hourly intervals. This
lfexibility enables targeted investigations, such as focusing on peak activity periods or unusual
meteorological conditions. Additionally, the system allows for selecting either a single station or multiple
stations, which is particularly useful for comparative analysis across diferent areas or locations.</p>
      <p>The regression analysis results yielded several key components that characterize the model’s quality
and the relationships between variables, as shown in figure 4. The primary model metrics include the
proportion of variance in the dependent variable explained by the independent variables (R-squared).
The F-statistic was also calculated to evaluate the overall significance of the model, along with its
p-value, confirming statistical relevance.</p>
      <p>Furthermore, residual analysis provides deeper insights into the model’s fit to the data. For each
independent variable, a regression coeficient table (  ) was generated, containing standard errors,
t-values, and p-values. The importance of variables is assessed based on their statistical significance,
helping to identify the most impactful factors.</p>
      <p>The visualization of results includes a prediction plot that illustrates the relationship between the
actual values of the dependent variable and those predicted by the model. This plot demonstrates the
model’s accuracy, providing a visual assessment of its fit to the real data and its predictive performance.
The graph highlights the degree to which the predicted values align with the actual observations,
showcasing the efectiveness of the regression model in capturing key patterns and dependencies within
the data.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Discussion</title>
      <p>The analysis of results obtained through the developed module revealed important patterns in the
formation of air quality indices (AQI and CAQI) and the nuances of their application. The radiation
factor emerged as a significant variable, substantially influencing the calculation of both indices.</p>
      <p>When comparing the approaches to index calculation, it was found that hourly averages are more
efective for CAQI, while daily averages are more appropriate for AQI (figure 5). This diference reflects
the distinct algorithms used for each index and their impact on the final results. Such findings allow for
more precise adaptation of analysis to specific tasks and regional characteristics.</p>
      <p>An intriguing observation concerns the influence of wind on CAQI. During nighttime, wind has
minimal impact on the air quality index, while during daytime or over an entire day, its influence
becomes noticeable. This suggests a reduction in wind intensity at night, diminishing its capacity to
disperse pollutants and, consequently, afecting the modeling outcomes.</p>
      <p>These findings highlight the importance of tailoring air quality analyses to account for specific
temporal and environmental factors, ensuring more accurate and actionable insights. Despite the
advantages of using IoT-based monitoring systems and regression analysis, several limitations exist. IoT
sensors are prone to measurement drift over time, requiring regular calibration to maintain accuracy.
Additionally, regression models assume linear relationships between variables, which may not fully
capture complex atmospheric interactions. Incorporating nonlinear models or hybrid approaches with
machine learning could help address these challenges.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusions</title>
      <p>This study confirms the efectiveness of applying regression analysis to identify patterns in atmospheric
air quality monitoring data, particularly in combination with IoT-based monitoring systems. The
integration of real-time sensor networks allows for more accurate and dynamic assessment of pollutant
levels and meteorological factors, improving the overall reliability of air quality analysis. Unlike
traditional methods that rely on static datasets from reference stations, this research introduces a
lfexible approach based on Ordinary Least Squares (OLS) regression, which enables the identification of
statistically significant dependencies. One of the key findings is the impact of radiation levels on air
quality indices, a factor often overlooked in previous studies. The results also highlight fundamental
diferences in data aggregation strategies, demonstrating that hourly averaging is more suitable for
CAQI, whereas daily averaging better aligns with AQI calculations. These insights allow for a more
precise selection of analytical techniques depending on the index used, enhancing predictive accuracy
in air quality assessments.</p>
      <p>A significant contribution of this work lies in the development of a custom analytical module
that facilitates flexible configuration of regression parameters, allowing users to explore pollutant
dependencies across diferent periods, stations, and environmental conditions. This tool supports
decision-making processes in environmental management by enabling targeted analysis of pollution
dynamics. The use of Python-based statistical libraries, particularly StatsModels, proves to be a
costefective alternative to more complex machine learning models while maintaining high interpretability.</p>
      <p>The study not only advances the methodological approach to air quality assessment but also sets the
foundation for future research. Expanding the analytical framework with nonlinear regression models
and time series forecasting could improve predictive capabilities, while further integration of data
mining techniques may reveal hidden dependencies within air pollution trends. The findings underscore
the importance of real-time monitoring systems and data-driven approaches for environmental
policymaking, providing a technological basis for more adaptive and responsive air quality management
strategies.</p>
      <p>The developed analytical framework can be utilized by policymakers to design data-driven air
pollution control strategies. By identifying critical pollution sources and peak contamination periods,
urban planners can implement targeted interventions such as emission restrictions in high-risk areas,
trafic rerouting, or increased green zones to mitigate pollution levels.</p>
      <p>Author Contributions: Conceptualization, Dmytro V. Shevchenko and Bella L. Holub; methodology, Dmytro V. Shevchenko;
software, Dmytro V. Shevchenko; writing – original draft, Dmytro V. Shevchenko; writing – review and editing, Bella L. Holub.
All authors have read and agreed to the published version of the manuscript.</p>
      <p>Funding: This research received no external funding.</p>
      <p>Data Availability Statement: No new data were created or analysed during this study. Data sharing is not applicable.
Conflicts of Interest: The authors declare no conflict of interest.</p>
      <p>Declaration on Generative AI: During the preparation of this work, the authors used GPT-4o in order to: Drafting
content, Generate literature review, Paraphrase and reword, Improve writing style, Abstract drafting, Grammar and spelling
check, Content enhancement. After using this service, the authors reviewed and edited the content as needed and takes full
responsibility for the publication’s content.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>J. Ángel</given-names>
            <surname>Martín-Baos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Rodriguez-Benitez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>García-Ródenas</surname>
          </string-name>
          , J. Liu,
          <article-title>IoT based monitoring of air quality and trafic using regression analysis</article-title>
          ,
          <source>Applied Soft Computing</source>
          <volume>115</volume>
          (
          <year>2022</year>
          )
          <article-title>108282</article-title>
          . doi:
          <volume>10</volume>
          .1016/j.asoc.
          <year>2021</year>
          .
          <volume>108282</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>C.</given-names>
            <surname>Banciu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Florea</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Bogdan</surname>
          </string-name>
          ,
          <article-title>Monitoring and Predicting Air Quality with IoT Devices</article-title>
          ,
          <source>Processes</source>
          <volume>12</volume>
          (
          <year>2024</year>
          )
          <year>1961</year>
          . doi:
          <volume>10</volume>
          .3390/pr12091961.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>J.</given-names>
            <surname>Bobulski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Szymoniak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Pasternak</surname>
          </string-name>
          ,
          <article-title>An IoT System for Air Pollution Monitoring with Safe Data Transmission</article-title>
          ,
          <source>Sensors</source>
          <volume>24</volume>
          (
          <year>2024</year>
          )
          <article-title>445</article-title>
          . doi:
          <volume>10</volume>
          .3390/s24020445.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>K.</given-names>
            <surname>Ravindra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kumar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kumar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Mor</surname>
          </string-name>
          ,
          <article-title>Enhancing accuracy of air quality sensors with machine learning to augment large-scale monitoring networks</article-title>
          ,
          <source>npj Climate and Atmospheric Science</source>
          <volume>7</volume>
          (
          <year>2024</year>
          )
          <article-title>326</article-title>
          . doi:
          <volume>10</volume>
          .1038/s41612-024-00833-9.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>M.</given-names>
            <surname>Dhanalakshmi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Radha</surname>
          </string-name>
          ,
          <article-title>Discretized Linear Regression and Multiclass Support Vector Based Air Pollution Forecasting Technique</article-title>
          ,
          <source>International Journal of Engineering Trends and Technology</source>
          <volume>70</volume>
          (
          <year>2022</year>
          )
          <fpage>315</fpage>
          -
          <lpage>323</lpage>
          . doi:
          <volume>10</volume>
          .14445/22315381/ijett-v70i11p234.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Kyiv</given-names>
            <surname>City State Administration</surname>
          </string-name>
          ,
          <source>Discussion of the Kyiv City State Administration Report on Atmospheric Air Quality Monitoring Results in Kyiv for</source>
          <year>2023</year>
          ,
          <year>2023</year>
          . URL: https://nubip.edu.ua/node/ 141976.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>StatsModels</given-names>
            <surname>Developers</surname>
          </string-name>
          ,
          <source>StatsModels Ordinary Least Squares Documentation</source>
          ,
          <year>2023</year>
          . URL: https: //www.statsmodels.org/dev/examples/notebooks/generated/ols.html.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>