<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Journal of Advanced Transportation
(2024). doi:10.1155/2024/9981657.
[25] M. Berlotti</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <article-id pub-id-type="doi">10.1016/j.trc.2024.104663</article-id>
      <title-group>
        <article-title>A benchmark methodology for urban trafic pattern clustering using SUMO-based expert-verified ground truth</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Vitaliy Pavlyshyn</string-name>
          <email>Vitaliy@ualeaders.com</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Eduard Manziuk</string-name>
          <email>eduard.em.km@gmail.com</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Adnène Arbi</string-name>
          <email>adnene.arbi@insat.ucar.tn</email>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Nebojsa Bacanin</string-name>
          <email>nbacanin@singidunum.ac.rs</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Iurii Krak</string-name>
          <email>iurii.krak@knu.ua</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Glushkov Cybernetics Institute</institution>
          ,
          <addr-line>40, Glushkov Ave., Kyiv, 03187</addr-line>
          ,
          <country country="UA">Ukraine</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Khmelnytskyi National University</institution>
          ,
          <addr-line>11, Instytuts'ka str., 29016 Khmelnytskyi</addr-line>
          ,
          <country country="UA">Ukraine</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Singidunum University</institution>
          ,
          <addr-line>32 Danijelova St., 11000 Belgrade</addr-line>
          ,
          <country country="RS">Serbia</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>Taras Shevchenko National University of Kyiv</institution>
          ,
          <addr-line>64/13, Volodymyrska str., Kyiv, 01601</addr-line>
          ,
          <country country="UA">Ukraine</country>
        </aff>
        <aff id="aff4">
          <label>4</label>
          <institution>University of Carthage</institution>
          ,
          <addr-line>Avenue de la République, 1054 Amilcar, Tunis</addr-line>
          ,
          <country country="TN">Tunisia</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <volume>3675</volume>
      <fpage>18</fpage>
      <lpage>27</lpage>
      <abstract>
        <p>Identifying urban trafic patterns is critical for reducing CO 2 emissions, yet existing research lacks standardized benchmarks for objectively evaluating clustering algorithms. This fundamental gap prevents accurate assessment because real-world trafic data typically lacks ground truth labels, making the validation of clustering quality impossible. In this work, we propose a methodology for controlled comparison of clustering algorithms using expert-verified ground truth labels derived from SUMO simulations of real urban scenarios. We systematically evaluate six clustering algorithms (HDBSCAN, K-Means, MeanShift, AfinityPropagation, BayesianGMM, AgglomerativeClustering) on both aggregated and concatenated vector representations of trafic data. Our experiments reveal that HDBSCAN achieves the highest accuracy in recovering ground truth scenarios (ARI=0.73, V-measure=0.79) on aggregated data, outperforming K-Means by 0.03 in ARI. Furthermore, aggregated representations systematically outperformed detailed temporal data for all algorithms with an average ARI improvement of 0.15. The study provides a validated benchmarking methodology enabling objective algorithm selection for trafic management systems aimed at emission reduction.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Clustering</kwd>
        <kwd>trafic patterns</kwd>
        <kwd>SUMO</kwd>
        <kwd>urban trafic</kwd>
        <kwd>trafic management</kwd>
        <kwd>CO</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        The growth of urbanization and urban trafic intensity creates serious challenges for sustainable city
development, especially in the context of combating climate change. The transport sector accounts
for over one-third of CO2 emissions from final energy consumption in cities, making trafic flow
optimization critically important for achieving climate goals [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Identifying characteristic urban trafic
patterns enables the development of trafic management strategies and reduction of environmental
impact [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
      <p>Modern urban transport systems face a critical problem of lacking reliable methods for objectively
assessing the quality of clustering algorithms when analyzing trafic flows. Most existing research is
based on real GPS data or trafic detector readings, which by their nature lack ground truth labels,
making accurate assessment of clustering quality impossible. This fundamental problem creates a
significant barrier to developing trafic management systems aimed at reducing CO 2 emissions.</p>
      <p>The previously unsolved part of the general problem of trafic flow optimization lies in the absence
of standardized methodologies for controlled comparison of clustering algorithms under conditions
where the true structure of trafic patterns is known. This gap is especially critical in the context of
cities’ climate commitments, where accurate identification of trafic patterns can significantly impact
emission reduction.</p>
      <p>Manifestations of this problem include the inability to determine which clustering algorithm best
identifies real trafic patterns under diferent urban conditions, lack of consensus on optimal metrics
for evaluating trafic data clustering quality, and shortage of controlled experimental conditions for
validating research results in this field.</p>
      <p>The main contribution of this research is a proposed methodology for controlled comparison of
clustering algorithms for trafic data using expert-verified ground truth labels created from real urban
trafic scenarios, which allows objective evaluation of diferent clustering approaches under conditions
maximally approximating real transport systems. The research also contributes to understanding the
impact of diferent trafic data aggregation approaches on clustering quality, which has direct practical
significance for developing trafic management systems oriented toward reducing CO 2 emissions
through trafic flow optimization.</p>
      <p>The structure of the paper is as follows. The “Literature Review” section analyzes existing approaches
to trafic data clustering. The “Materials and Methods” section describes the experimental methodology
and algorithms. The “Results” section presents quantitative algorithm indicators. The “Discussion”
section interprets the obtained results and compares them with existing approaches.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related works</title>
      <p>
        This section provides an overview of current research on trafic flow clustering, simulation approaches
using SUMO for CO2 emission assessment, and intelligent trafic management systems. Trafic data
clustering represents an actively developing research area evolving under the influence of growing
needs to reduce CO2 emissions from transport. The transport sector accounts for over one-third of
CO2 emissions from final consumption [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], making trafic flow optimization critically important for
achieving climate goals.
      </p>
      <p>
        Analysis of current research reveals two dominant approaches: centroid-based and density-based
clustering methods. Systematic analysis of K-means application for zone classification by congestion
level showed its quality in identifying delay patterns associated with diferent types of trafic flows [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
However, limitations of centroid methods stimulated development of hybrid approaches. Combining
pairwise comparison with density-based methods proved applicable for processing multidimensional
time series, demonstrating advantages over traditional centroid algorithms [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
      </p>
      <p>
        Mathematical models of urban mobility optimization are developing in parallel [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], integrating with
graph-based approaches for analyzing spatiotemporal cluster evolution [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. These methodologies allow
not only identifying static patterns but also tracking their dynamics over time, which is critical for
trafic flow forecasting.
      </p>
      <p>
        Comparative algorithm analysis reveals HDBSCAN’s advantage due to its ability to automatically
determine the number of clusters and handle noise. The integration of visual analytics approaches with
machine learning algorithms [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] provides methodological foundation for combining expert knowledge
with automated pattern detection in complex datasets. Two-phase approaches integrating GIS and
HDBSCAN demonstrated advantages in spatial analysis of accident-prone areas [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], confirmed by
enhanced versions for multi-level spatial pattern analysis [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. A critical advantage of HDBSCAN is
detecting variable-density clusters, corresponding to real characteristics of urban trafic flows.
      </p>
      <p>
        Further methodology development led to creation of emission-sensitive clustering algorithms [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ],
which extend dynamic pattern detection capabilities. For high-dimensional data, stratified density
algorithms are proposed [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], solving the curse of dimensionality problem in big data.
      </p>
      <p>
        The transition from real data to controlled experiments determines the growing role of simulation
tools. SUMO became the standard thanks to integration capabilities with real sensor data [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. Validation
studies on heterogeneous transport conditions confirmed SUMO’s universality through achieving high
accuracy correspondence between simulation and real data [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. Research on generating and calibrating
microscopic urban models for diferent scenarios [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] provides methodological foundation for using
SUMO in transport system research.
      </p>
      <p>
        General environmental emission reduction trends drive integration of trafic pattern analysis with
emission assessment. Systematization of carbon emission reduction technologies [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] and development
of multimodal approaches [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] demonstrate alignment of trafic optimization with climate goals. CO 2
emission forecasting using deep learning and explainable artificial intelligence achieved quite high
accuracy [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ], revealing that fuel consumption conditions in urban and suburban settings have greater
impact than vehicle engine characteristics.
      </p>
      <p>
        Predictive models for intersections using portable measurement systems and density clustering
algorithms [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ] provide detailed micro-level analysis, complementing macroscopic approaches. Combining
big data and artificial intelligence opens new management possibilities. Adaptive trafic light control
can reduce travel time by 11% during peak hours, extrapolating to annual CO2 emission reduction of
31.73 million tons [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ]. Recent work on AI-driven trafic signal control systems [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ] demonstrates the
direct applicability of machine learning approaches to emission reduction, reinforcing the practical
importance of accurate trafic pattern identification for environmental objectives.
      </p>
      <p>Modeling approaches for intelligent transport systems [20] emphasize the need for environmental
considerations in urban mobility optimization, aligning with the emission reduction focus of this
research. Multi-agent deep reinforcement learning approaches [21] and connected vehicle coordination
systems [22] demonstrate evolution from centralized to distributed control. Cooperative trafic light
control methods with deep learning [23] ensure coordination between multiple intersections, creating
an adaptive control network.</p>
      <p>Methodological analysis shows transition from one-dimensional to multi-level approaches.
Comprehensive reviews emphasize the importance of quality simulation data [24], implemented through
two-level machine learning architectures [25]. Integration of spatiotemporal data with real-time route
optimization [26] and multi-scale models for medium-term forecasting [27] demonstrate growing
complexity of predictive systems. Network approach to mobility analysis through cluster detection
methods [28] is complemented by high-resolution cellular network data analysis [29]. Systematization
of pattern identification methods using smart card data and deep learning application for spatiotemporal
analysis [30] demonstrate evolution from descriptive to predictive urban mobility modeling.</p>
      <p>Critical analysis of traditional methods reveals their limitations in determining optimal cluster
numbers. Metaheuristic approaches [31] and evolutionary K-means methods [32] ofer solutions through
automatic parameter optimization, especially important for trafic data with unknown cluster structure.
Diversity-based approaches to clustering [33] demonstrate that ensemble methods leveraging multiple
clustering perspectives can improve robustness and accuracy, particularly relevant for heterogeneous
trafic pattern identification. Selection criteria for ensemble models [ 34] provide theoretical basis
for comparing multiple clustering algorithms systematically, which motivates the multi-algorithm
evaluation approach adopted in this study.</p>
      <p>Thus, analysis revealed that despite significant progress in the considered field, critical gaps remain.
Absence of benchmark data complicates objective algorithm comparison. Fuzzy clustering application
emphasizes the need for controlled experimental conditions and ground truth labels for reliable result
validation. The conducted analysis reveals a fundamental contradiction: despite the diversity of available
clustering algorithms and their theoretical advantages, absence of labeled benchmark datasets makes
objective quality comparison impossible for trafic data. This contradiction determines the research goal:
improving trafic pattern identification quality through developing an approach that ensures objective
clustering algorithm comparison on controlled simulation data with known ground truth structure.</p>
      <p>To achieve this goal, the following tasks are formulated:
1. Create a trafic flow simulation model verified by experts and based on real urban scenarios.
2. Conduct systematic comparison of six representative clustering algorithms using standardized
metrics.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Materials and methods</title>
      <sec id="sec-3-1">
        <title>3.1. General approach schema</title>
        <p>The general schema of the proposed approach is shown in Figure 1. The proposed schema implements
the research hypothesis that using a city map and empirical knowledge about existing trafic flow
behavior, SUMO can simulate trafic movement that corresponds to real trafic and allows objective
evaluation of clustering algorithm quality.</p>
        <p>The approach consists of five main stages: (1) creating a simulation model based on real urban
scenarios; (2) generating trafic data in time window format; (3) converting data into vector representation for
clustering; (4) applying clustering algorithms; (5) evaluating result quality using standardized metrics.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Trafic representation model</title>
        <p>Trafic flow is formalized as a sequence of trafic light intersection states at discrete time moments.
Each trafic light state  at time  is characterized by a vector of vehicle counts on each trafic lane.
Two types of vector data representation are used for analysis: concatenated and averaged values.</p>
        <sec id="sec-3-2-1">
          <title>3.2.1. Concatenated values</title>
          <p>Concatenated values represent detailed temporal representation of trafic flow with preservation of
complete information about vehicle count changes over time. For each 30-minute time window  , a
vector   of dimensionality  = 70 × 180 = 12, 600 is formed, where 70 is the total number of
trafic lanes at all trafic lights, and 180 is the number of time slices (data is recorded every 10 seconds
during 30 minutes):
  = [111, 112, . . . , 117, 121, . . . ,  ],
(1)
where   is the number of vehicles on the -th lane of the -th trafic light at time moment ,
 ∈ {1, 2, . . . , 180},  ∈ {1, 2, . . . , 10},  ∈ {1, 2, . . . ,  }, where  is the number of lanes at the
-th trafic light. This approach preserves temporal trafic dynamics but creates a high-dimensional
feature space, which may lead to the curse of dimensionality problem in clustering.</p>
        </sec>
        <sec id="sec-3-2-2">
          <title>3.2.2. Averaged values</title>
          <p>Averaged values represent aggregated temporal representation, where one summary value is computed
for each trafic lane over the entire window period. A vector   of dimensionality  = 70 is formed:
  = [11, 12, . . . , 17, 21, . . . ,  ],
where each component is computed as arithmetic mean:
(2)
(3)
  = 1 ∑︁  ,</p>
          <p>=1
where  = 180 is the number of time slices in the window. Thus,   is the average number of vehicles
on the -th lane of the -th trafic light over the entire window period.</p>
          <p>This approach sacrifices temporal detail for reducing feature space dimensionality by 180 times,
providing better conditions for clustering algorithms and increasing resistance to short-term data
lfuctuations. The trade-of between information completeness and clustering quality is investigated
experimentally by comparing results on both representation types.</p>
        </sec>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Trafic pattern determination method</title>
        <p>Trafic pattern identification is implemented through a sequential process integrating expert knowledge
with controlled simulation to ensure objective clustering algorithm evaluation. The complete algorithmic
procedure is presented in Algorithm 1 and Figure 2.</p>
        <sec id="sec-3-3-1">
          <title>3.3.1. Creating base trafic scenarios</title>
          <p>Detailed schema of trafic pattern determination method is shown in Figure 2. At the first stage, four
base trafic scenarios are formed based on surveillance camera data analysis and expert knowledge
from municipal trafic specialists. The morning scenario is characterized by intensive trafic to the
city center and market zone, reflecting typical commuting migrations on working days. The evening
scenario represents reverse flow from center and market to residential areas. The random scenario
models uniformly distributed trafic without clearly expressed dominant direction. The special scenario
reflects characteristic trafic from the peripheral Hrechany district, which difers from general city
patterns due to its location specifics and transport infrastructure. Each scenario is verified by experts to
ensure correspondence with real city trafic flows.</p>
        </sec>
        <sec id="sec-3-3-2">
          <title>3.3.2. Trafic flow simulation</title>
          <p>At the second stage, created scenarios are implemented in SUMO simulation environment version 1.15.0
using a real city map. The simulation covers an 11-hour period with recording states of 10 key trafic
light intersections. Data is collected at 10-second intervals, providing suficient temporal resolution for
capturing trafic flow dynamics. In total, 4,080 trafic light state records are generated. The simulation
is configured considering real urban road network parameters.</p>
        </sec>
        <sec id="sec-3-3-3">
          <title>3.3.3. Time window formation</title>
          <p>At the third stage, generated data is segmented into 30-minute time windows to ensure suficient
information volume for statistical pattern analysis. An overlapping window method with 10-minute
shift step is used, allowing increase of observation count from 22 to 66 and ensuring temporal result
stability.</p>
        </sec>
        <sec id="sec-3-3-4">
          <title>3.3.4. Trafic data vectorization</title>
          <p>At the fourth stage, each time window is converted into vector representation according to the
formalization described in Section 3.2. Obtained vectors are standardized using z-score normalization.</p>
        </sec>
        <sec id="sec-3-3-5">
          <title>3.3.5. Applying clustering algorithms</title>
          <p>At the fifth stage, six representative clustering algorithms with optimized parameters determined
through preliminary validation on pilot dataset are applied to vectorized data. AfinityPropagation
is used with damping parameter equal to 0.8. MeanShift is applied with automatic bandwidth
deAlgorithm 1 Trafic Pattern Determination and Evaluation Method
termination. BayesianGMM is configured with _ = 20 and full covariance type.
AgglomerativeClustering uses distance threshold 0.15. HDBSCAN is applied with cosine metric and
__ = 4. K-Means is tested with 5 and 7 clusters.</p>
        </sec>
        <sec id="sec-3-3-6">
          <title>3.3.6. Creating ground truth labels</title>
          <p>Ground truth labels are formed based on the simulation time schedule, where each of 66 time windows
receives the label of the corresponding trafic scenario according to its activity period. Scenario time
boundaries are determined considering window overlap and the need to ensure suficient observation
count for each pattern type.
◁ 11 hours × 3,600 sec/hour</p>
          <p>◁ Set of time windows
◁ Ground truth labels for windows
◁ Step 10 minutes</p>
        </sec>
      </sec>
      <sec id="sec-3-4">
        <title>3.4. Evaluation metrics</title>
        <p>Clustering quality was evaluated by two metric categories. Internal metrics (Silhouette Score,
DaviesBouldin Index, Calinski-Harabasz Index) characterize geometric properties of formed clusters without
using external information. External metrics (V-measure, Adjusted Rand Index (ARI), Normalized
Mutual Information (NMI), Fowlkes-Mallows Score) compare clustering results with ground truth labels
created by experts.</p>
      </sec>
      <sec id="sec-3-5">
        <title>3.5. Experimental setup</title>
        <p>Experiments were conducted on simulation data generated in SUMO 1.15.0 using a real city map.
Clustering was performed using scikit-learn 1.3.0 in Python 3.9 environment. Ground truth labels
were created based on simulation time schedule, where each of 66 windows received the label of
corresponding scenario according to activity period.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Results</title>
      <sec id="sec-4-1">
        <title>4.1. Experimental results presentation</title>
        <sec id="sec-4-1-1">
          <title>4.1.1. Internal Clustering Quality Metrics</title>
        </sec>
        <sec id="sec-4-1-2">
          <title>4.1.2. External clustering quality metrics</title>
          <p>Table 2 shows comparison results with ground truth labels of expert-verified scenarios, allowing
evaluation of accuracy in recovering true trafic pattern structure. Figure 3 presents comparison of
Silhouette Score and Adjusted Rand Index indicators for all algorithms. The graph demonstrates the
advantage of averaged data (circles) over concatenated (triangles), and positioning of HDBSCAN and
K-Means in the upper right part indicates their optimal balance between geometric cluster quality and
accuracy of ground truth scenario recovery.</p>
          <p>Figure 4 presents a heatmap of normalized values of five quality metrics for averaged data. HDBSCAN
demonstrates the most balanced high indicators across all external metrics, while BayesianGMM shows
critically low values across practically all criteria.</p>
        </sec>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Results analysis</title>
        <sec id="sec-4-2-1">
          <title>4.2.1. Algorithm comparison by internal metrics</title>
          <p>Internal metrics analysis revealed a clear pattern: all algorithms demonstrate better results on aggregated
(averaged) data compared to detailed (concatenated) values. K-Means with 7 clusters showed the
highest Silhouette Score (0.56) and Calinski-Harabasz Index (279.58) for averaged values, indicating best
geometric cluster quality. MeanShift demonstrated the lowest Davies-Bouldin Index (0.64) on averaged
data, indicating optimal ratio of intra-cluster compactness and inter-cluster separation. Critical quality
deterioration is observed for concatenated data: average Silhouette Score decrease is 0.25 points, and
Calinski-Harabasz Index decreases on average by 8.7 times. AgglomerativeClustering showed the most
dramatic quality drop on concatenated data.</p>
        </sec>
        <sec id="sec-4-2-2">
          <title>4.2.2. Ground truth scenario recovery accuracy</title>
          <p>External metrics demonstrate HDBSCAN’s advantage for accurate recovery of expert-verified trafic
scenarios. HDBSCAN achieved the highest ARI (0.73) and V-measure (0.79) on averaged data, meaning
73% consistency with ground truth labels and balance between completeness and cluster homogeneity.
K-Means (5 clusters) showed second place in accuracy (ARI = 0.70). MeanShift, despite high internal
metrics, showed somewhat lower ground truth scenario recovery accuracy (ARI = 0.63). The worst
results were demonstrated by AgglomerativeClustering on concatenated data (ARI = 0.03), practically
corresponding to random point distribution across clusters.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Discussion</title>
      <p>Results demonstrate clear advantage of aggregated (averaged) data over detailed (concatenated) values
for all studied algorithms. HDBSCAN showed highest results by external metrics (ARI = 0.73, V-measure
= 0.79), confirming its quality for trafic pattern identification. K-Means with 7 clusters achieved the
highest Silhouette Score (0.56) but showed lower results in ground truth scenario recovery accuracy.</p>
      <p>Significant quality deterioration on concatenated data (for example, HDBSCAN ARI decreases from
0.73 to 0.61) indicates that high dimensionality and temporal detail complicate stable pattern detection.
AgglomerativeClustering showed critically low ARI (0.03) on concatenated data due to creating excessive
numbers of small clusters.</p>
      <p>Unlike existing research focusing on GPS trajectory analysis, our approach uses aggregated data
from trafic light intersections, which better corresponds to practical urban trafic management needs.
Results align with previous work conclusions regarding HDBSCAN advantages for trafic data but first
demonstrate quantitative comparison on a controlled dataset.</p>
      <p>Main limitations include: (1) using simulation data that may not fully reflect real trafic complexity;
(2) limited number of scenarios (4 types) that may not cover all urban trafic pattern diversity; (3) focus
on one city, limiting result generalizability; (4) absence of considering external factors (weather, events,
accidents).</p>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion</title>
      <p>The research demonstrated HDBSCAN’s capability for trafic pattern identification based on
expertverified simulation data, achieving the highest ground truth scenario recovery accuracy (ARI = 0.73,
V-measure = 0.79) on aggregated data. Key numerical results show a significant advantage of using
averaged values over detailed time series, with an average ARI improvement of 0.15 for all algorithms.
HDBSCAN outperformed the baseline K-Means by 0.03 in ARI and 0.06 in V-measure, proving the
efectiveness of density-based clustering in this domain. A critical finding is the susceptibility of
concatenated, high-dimensional data to the curse of dimensionality, which drastically reduced the
performance of algorithms like AgglomerativeClustering. The main limitation of this study lies in
using simulation data from a single city with a limited set of four scenarios, which may restrict the
generalizability of results to other urbanized territories with diferent topological complexity. To
address this, future research expansion is planned through integrating real camera data to validate
simulation findings, testing the methodology on multiple cities to ensure robustness, and developing
hybrid approaches to improve mixed trafic scenario identification. These steps will further refine the
benchmark methodology, enabling more efective trafic management systems capable of significant
CO2 emission reductions.</p>
    </sec>
    <sec id="sec-7">
      <title>Funding</title>
      <p>This research was funded by the European Union’s Horizon Europe Framework Programme under
grant agreement No. 101148374, project “U_CAN: Ukraine towards Carbon Neutrality.” The views and
opinions expressed are the authors’ own and do not necessarily reflect those of the European Union or
the funding agency, the European Climate, Infrastructure and Environment Executive Agency.</p>
    </sec>
    <sec id="sec-8">
      <title>Acknowledgments</title>
      <p>The authors would like to express their gratitude to the European Union’s Horizon Europe Framework
Programme for the financial support that made this research possible. Wealso extend our sincere
appreciation to the developers and open-source communities behind the essential software tools used
in this study, including SUMO, scikit-learn, pandas, and NumPy, whose contributions were invaluable
to our work.</p>
    </sec>
    <sec id="sec-9">
      <title>Declaration on Generative AI</title>
      <p>The authors have not employed any Generative AI tools.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>International</given-names>
            <surname>Energy</surname>
          </string-name>
          <string-name>
            <given-names>Agency</given-names>
            , Transport - energy
            <surname>system</surname>
          </string-name>
          ,
          <year>2024</year>
          . URL: https://www.iea.org/ energy-system/transport.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>N.</given-names>
            <surname>Rouky</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bousouf</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Benmoussa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Fri</surname>
          </string-name>
          ,
          <article-title>A spatiotemporal analysis of trafic congestion patterns using clustering algorithms: A case study of casablanca</article-title>
          ,
          <source>Decision Analytics Journal</source>
          <volume>10</volume>
          (
          <year>2024</year>
          )
          <article-title>100404</article-title>
          . doi:
          <volume>10</volume>
          .1016/j.dajour.
          <year>2024</year>
          .
          <volume>100404</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>I. T.</given-names>
            <surname>Sarteshnizi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Sarvi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. A.</given-names>
            <surname>Bagloee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Nassir</surname>
          </string-name>
          ,
          <article-title>Temporal pattern mining of urban trafic volume data: a pairwise hybrid clustering method</article-title>
          ,
          <string-name>
            <surname>Transportmetrica</surname>
            <given-names>B</given-names>
          </string-name>
          :
          <string-name>
            <surname>Transport Dynamics</surname>
          </string-name>
          (
          <year>2023</year>
          ). doi:
          <volume>10</volume>
          .1080/21680566.
          <year>2023</year>
          .
          <volume>2185496</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>H.</given-names>
            <surname>Ulvi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. A.</given-names>
            <surname>Yerlikaya</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Yildiz</surname>
          </string-name>
          ,
          <article-title>Urban trafic mobility optimization model: A novel mathematical approach for predictive urban trafic analysis</article-title>
          ,
          <source>Applied Sciences</source>
          <volume>14</volume>
          (
          <year>2024</year>
          )
          <article-title>5873</article-title>
          . doi:
          <volume>10</volume>
          .3390/ app14135873.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>I.</given-names>
            <surname>Portugal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Alencar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Cowan</surname>
          </string-name>
          ,
          <article-title>A framework for spatial-temporal cluster evolution representation and analysis based on graphs</article-title>
          ,
          <source>Scientific Reports</source>
          <volume>14</volume>
          (
          <year>2024</year>
          )
          <article-title>5873</article-title>
          . doi:
          <volume>10</volume>
          .1038/ s41598-024-72504-x.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>I.</given-names>
            <surname>Krak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Barmak</surname>
          </string-name>
          , E. Manziuk,
          <article-title>Using visual analytics to develop human and machine-centric models: A review of approaches and proposed information technology</article-title>
          ,
          <source>Computational Intelligence</source>
          <volume>38</volume>
          (
          <year>2022</year>
          )
          <fpage>921</fpage>
          -
          <lpage>946</lpage>
          . doi:
          <volume>10</volume>
          .1111/coin.12289.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>D.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Cai</surname>
          </string-name>
          ,
          <article-title>A two-phase clustering approach for trafic accident black spots identification: integrated gis-based processing</article-title>
          and hdbscan model,
          <source>International Journal of Injury Control and Safety Promotion</source>
          (
          <year>2023</year>
          ). doi:
          <volume>10</volume>
          .1080/17457300.
          <year>2022</year>
          .
          <volume>2164309</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>T.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <article-title>A detection of multi-level co-location patterns based on column calculation and hdbscan clustering, Intelligent Data Analysis (</article-title>
          <year>2025</year>
          ). doi:
          <volume>10</volume>
          .1177/ 1088467X241308765.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>D. M.</given-names>
            <surname>Bot</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Peeters</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Liesenborgs</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Aerts</surname>
          </string-name>
          ,
          <article-title>Flasc: a flare-sensitive clustering algorithm</article-title>
          ,
          <source>PeerJ Computer Science</source>
          <volume>11</volume>
          (
          <year>2025</year>
          )
          <article-title>e2792</article-title>
          . doi:
          <volume>10</volume>
          .7717/peerj-cs.
          <volume>2792</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>G.</given-names>
            <surname>Monko</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Kimura</surname>
          </string-name>
          ,
          <article-title>Enhanced stratified sampling-density-based spatial clustering of applications with noise (ss-dbscan) for high-dimensional data</article-title>
          ,
          <source>Data Science</source>
          <volume>8</volume>
          (
          <year>2025</year>
          ). doi:
          <volume>10</volume>
          .1177/ 24518492251349080.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>F.</given-names>
            <surname>Gonçalves</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. O.</given-names>
            <surname>Silva</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Santos</surname>
          </string-name>
          ,
          <string-name>
            <surname>A. M. A. C. Rocha</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Peixoto</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Durães</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Machado</surname>
          </string-name>
          ,
          <article-title>Urban trafic simulation using mobility patterns synthesized from real sensors</article-title>
          ,
          <source>Electronics</source>
          <volume>12</volume>
          (
          <year>2023</year>
          )
          <article-title>4971</article-title>
          . doi:
          <volume>10</volume>
          .3390/electronics12244971.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>C.</given-names>
            <surname>Stang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Bogenberger</surname>
          </string-name>
          ,
          <article-title>Calibration of microscopic trafic simulation in an urban environment using gps-data</article-title>
          ,
          <source>in: SUMO Conference Proceedings</source>
          , volume
          <volume>5</volume>
          ,
          <year>2024</year>
          , pp.
          <fpage>71</fpage>
          -
          <lpage>78</lpage>
          . doi:
          <volume>10</volume>
          .52825/ scp.v5i.
          <fpage>1099</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>A.</given-names>
            <surname>Keler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kunz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Amini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Bogenberger</surname>
          </string-name>
          ,
          <article-title>Calibration of a microscopic trafic simulation in an urban scenario using loop detector data: A case study within the digital twin munich</article-title>
          ,
          <source>in: SUMO Conference Proceedings</source>
          , volume
          <volume>4</volume>
          ,
          <year>2023</year>
          , p.
          <fpage>153</fpage>
          . doi:
          <volume>10</volume>
          .52825/scp.v4i.
          <fpage>223</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Dong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <article-title>Transportation carbon reduction technologies: A review of fundamentals, application, and performance</article-title>
          ,
          <source>Journal of Trafic and Transportation Engineering (English Edition)</source>
          <volume>11</volume>
          (
          <year>2024</year>
          )
          <fpage>1340</fpage>
          -
          <lpage>1377</lpage>
          . doi:
          <volume>10</volume>
          .1016/j.jtte.
          <year>2024</year>
          .
          <volume>11</volume>
          .001.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>I.</given-names>
            <surname>Derpich</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Duran</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Carrasco</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Moreno</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Fernandez-Campusano</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Espinosa-Leal</surname>
          </string-name>
          ,
          <article-title>Pursuing optimization using multimodal transportation system: A strategic approach to minimizing costs and co2 emissions</article-title>
          ,
          <source>Journal of Marine Science and Engineering</source>
          <volume>12</volume>
          (
          <year>2024</year>
          ). doi:
          <volume>10</volume>
          .3390/ jmse12060976.
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>G. M. I.</given-names>
            <surname>Alam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. A.</given-names>
            <surname>Tanim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. K.</given-names>
            <surname>Sarker</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Watanobe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Islam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. F.</given-names>
            <surname>Mridha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Nur</surname>
          </string-name>
          ,
          <article-title>Deep learning model based prediction of vehicle co2 emissions with explainable ai integration for sustainable environment</article-title>
          ,
          <source>Scientific Reports</source>
          <volume>15</volume>
          (
          <year>2025</year>
          ). doi:
          <volume>10</volume>
          .1038/s41598-025-87233-y.
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>M.</given-names>
            <surname>Mądziel</surname>
          </string-name>
          ,
          <article-title>Predictive methods for co2 emissions and energy use in vehicles at intersections</article-title>
          ,
          <source>Scientific Reports</source>
          <volume>15</volume>
          (
          <year>2025</year>
          )
          <article-title>6463</article-title>
          . doi:
          <volume>10</volume>
          .1038/s41598-025-91300-9.
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>K.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Ding</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Zheng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Fang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Gu</surname>
          </string-name>
          ,
          <article-title>Big-data empowered trafic signal control could reduce urban carbon emission</article-title>
          ,
          <source>Nature Communications</source>
          <volume>16</volume>
          (
          <year>2025</year>
          )
          <year>2013</year>
          . doi:
          <volume>10</volume>
          .1038/s41467-025-56701-4.
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>O.</given-names>
            <surname>Ryzhanskyi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Pavlyshyn</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Radiuk</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Manziuk</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Barmak</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Krak</surname>
          </string-name>
          ,
          <article-title>Ai-driven trafic signal control system to reduce co2 emissions</article-title>
          ,
          <source>in: CEUR Workshop Proceedings</source>
          , volume
          <volume>3974</volume>
          ,
          <year>2025</year>
          , pp.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>