<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Improvement in Road Crack Detection Based on Multiple Attention Mechanisms</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Junqing Wang</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Qi Li</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Lin Meng</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>College of Science and Engineering, Ritsumeikan University</institution>
          ,
          <addr-line>1-1-1 Noji-higashi, Kusatsu, Shiga, 525-8577</addr-line>
          ,
          <country country="JP">Japan</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Graduate School of Science and Engineering, Ritsumeikan University</institution>
          ,
          <addr-line>1-1-1 Noji-higashi, Kusatsu, Shiga, 525-8577</addr-line>
          ,
          <country country="JP">Japan</country>
        </aff>
      </contrib-group>
      <fpage>83</fpage>
      <lpage>96</lpage>
      <abstract>
        <p>With the rapid expansion of global road networks, pavement maintenance is increasingly challenged by ifne and irregular cracks caused by trafic loads and environmental conditions. Traditional inspection methods, including manual patrols and classical image processing, are often ineficient and lack sensitivity to subtle crack patterns. To address these limitations, we propose a novel road crack recognition framework that integrates object detection with semantic segmentation. The detection module enhances YOLOv11 by incorporating a Diverse Branch Block and a Triplet Attention Module to improve multiscale feature extraction with low computational cost. The segmentation module extends TransUNet by replacing the standard Transformer Encoder with BiFormer Block and embedding a parallel Swin Transformer Block, enabling efective global-local context fusion. Experimental results demonstrate that the improved detection model achieves 83.9% mAP@50 and 65.8% mAP@[50:95] on the Ultralytics CrackSeg dataset. Meanwhile, the enhanced segmentation model attains 78.46% mean Intersection-overUnion on the CRACK500 dataset. These findings confirm the efectiveness of the proposed multi-attention architecture for accurate and scalable road crack analysis.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Road Crack Detection</kwd>
        <kwd>YOLOv11</kwd>
        <kwd>TransUNet</kwd>
        <kwd>Deep learning</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>The rapid expansion and aging of road infrastructure have intensified the demand for accurate
and scalable pavement crack detection methods to ensure timely maintenance and structural
safety. However, cracks are often fine, irregular, and low-contrast, making them dificult to
detect with traditional manual or rule-based methods.</p>
      <p>
        Deep learning-based object detection models[
        <xref ref-type="bibr" rid="ref1 ref2 ref3">1, 2, 3</xref>
        ], such as YOLOv8 [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] and hybrid
CNN–Transformer frameworks [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], have been applied to road crack detection with encouraging
results. However, these models frequently exhibit limited precision in detecting small,
fragmented, or low-contrast cracks, particularly under complex surface textures or non-uniform
lighting conditions. On the other hand, semantic segmentation models like DeepLabv3+ with
attention [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] and domain-adaptive approaches [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] can achieve pixel-level delineation but still
face challenges in segmenting thin cracks accurately and consistently, especially in real-world
conditions.
      </p>
      <p>To address these precision limitations from both tasks, we propose two improved models
for road crack recognition: an enhanced object detector based on YOLOv11 incorporating a
Diverse Branch Block and Triplet Attention Module, and a segmentation network that upgrades
TransUNet by replacing its Transformer Encoder with BiFormer Block and adding a parallel
Swin Transformer Block.</p>
      <p>Overall, our main contributions can be summarized below:
• We embed the Diverse Branch Block into the YOLOv11 backbone and insert the Triplet
Attention Module in the neck to enrich multi-path features and joint spatial–channel
attention.
• We replace the Transformer layer of TransUNet Encoder with BiFormer Block and add a
parallel Swin Transformer Block path to create a dual-encoder that captures
complementary global-local context.</p>
      <p>The remainder of this article is organized as follows. Section 2 reviews the previous road
crack detection and the related YOLO models and semantic segmentation models. Details of
our proposed methods are introduced in Section 3. Section 4 presents the experiments and the
datasets. Finally, Section 5 concludes the paper.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>Object detection and semantic segmentation are two mainstream vision approaches used in
automated pavement crack analysis. Object detection locates crack regions with bounding
boxes, ofering real-time inference and localization capabilities. Semantic segmentation, in
contrast, provides pixel-level classification, enabling precise extraction of crack shapes and
boundaries. Due to their respective strengths—eficiency in detection and accuracy in structural
delineation—both methods have been widely applied in pavement inspection scenarios. They
help overcome challenges posed by the fine, irregular, and low-contrast nature of cracks,
especially under varying illumination and surface textures.</p>
      <p>
        YOLO models are widely applied in crack detection due to their real-time speed and
localization capabilities[
        <xref ref-type="bibr" rid="ref8 ref9">8, 9</xref>
        ]. Xia et al. [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] enhanced YOLOv8 with attention mechanisms to improve
detection of multi-scale bridge cracks, though performance on narrow cracks remained limited.
Yu et al. [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] proposed a coordinate-attention YOLOv8 variant for concrete cracks, achieving
better localization but showing reduced robustness under noisy textures. Ren and Zhong [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]
integrated feature fusion and attention into YOLO for building crack detection, improving recall
at the expense of increased computation. These approaches demonstrate progress, yet detecting
ifne or fragmented cracks remains challenging.
      </p>
      <p>
        Semantic segmentation enables pixel-level crack extraction and is efective for detecting
ifne or irregular patterns. Yoon et al. [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] applied an attention-augmented UNet++ to port
pavement cracks, achieving good accuracy but limited generalization. Tan et al. [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] introduced
ETAFHrNet, a transformer-based model for asymmetric cracks, showing strong performance
but high computational demand. Zhang et al. [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] proposed DLANet for sealed crack
segmentation, which achieved fine boundary delineation but required extensive annotations. Despite
their precision, segmentation models often face trade-ofs between accuracy, eficiency, and
data dependence. To address the distinct limitations of existing detection and segmentation
models—namely, insuficient sensitivity to fine-grained features in detection and poor eficiency
or generalizability in segmentation—we propose two tailored network improvements targeting
each task respectively.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Methodology</title>
      <p>In this study, we propose two improved models for road crack recognition: an object detection
network based on YOLOv11 and a segmentation network built upon TransUNet. In the detection
branch, the original convolutional backbone is replaced with a Diverse Branch Block to enhance
multi-scale feature extraction, while a Triplet Attention Module is integrated into the neck to
strengthen spatial and channel-wise attention. In the segmentation branch, the Transformer
encoder in TransUNet is replaced with BiFormer Block for improved global context modeling,
and a parallel Swin Transformer Block is introduced to capture hierarchical local features.</p>
      <sec id="sec-3-1">
        <title>3.1. YOLOv11 Object Detection Model</title>
        <p>
          YOLOv11 is a recent advancement in the YOLO series that aims to improve detection
performance through architectural refinement and enhanced feature representation. It maintains
the fast, single-stage design characteristic of previous YOLO models while introducing better
optimization for small objects, high-density scenes, and complex backgrounds. These
improvements position YOLOv11 as an efective baseline for real-time object detection tasks such as
road crack detection [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ]. Its overall architecture is shown in Figure 1:
        </p>
        <p>YOLOv11 introduces architectural improvements to enhance detection accuracy and maintain
real-time performance. Its backbone employs the SPPF and C2PSA modules to expand receptive
ifelds and strengthen spatial attention. The neck adopts a feature pyramid structure with
C3k2 and upsampling layers for efective multi-scale feature fusion, while the head generates
predictions at diferent scales using lightweight CBS Blocks. These enhancements improve
robustness to small and dense objects, making YOLOv11 well-suited for fine-grained crack
detection tasks.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Improved YOLOv11 Model</title>
        <p>As shown in Figure 2, we propose an improved YOLOv11 model based on Diverse Branch Block
and Triplet Attention.</p>
        <p>
          YOLOv11 demonstrates strong real-time performance and competitive detection accuracy,
benefiting from its eficient single-stage design and optimized feature processing pipeline.
However, it still encounters challenges in detecting fine-grained or low-contrast targets, particularly
in cluttered or textured scenes, where the standard backbone may lack suficient feature diversity
and spatial focus. Inspired by Fan et al.[
          <xref ref-type="bibr" rid="ref17">17</xref>
          ], we adopt the Diverse Branch Block to enhance
multi-scale feature representation through diverse receptive fields. Additionally, following the
approach of Li et al.[
          <xref ref-type="bibr" rid="ref18">18</xref>
          ], we integrate the Triplet Attention Module to reinforce joint spatial and
channel attention. These improvements are embedded into the YOLOv11 backbone and neck,
respectively, leading to better localization of subtle targets while maintaining the lightweight
structure suitable for real-time detection.
        </p>
        <sec id="sec-3-2-1">
          <title>3.2.1. Diverse Branch Block</title>
          <p>Diverse Branch Block is a reparameterizable convolution module designed to enrich the
representational capacity of convolutional layers by introducing multi-branch structures during
training. As illustrated in Figure 3, Diverse Branch Block consists of four parallel branches:
a standard k×k convolution, a 1×1 convolution followed by k×k, a 1×1 convolution with
average pooling, and a standalone average pooling path. All branches are followed by batch
normalization and aggregated before a nonlinear activation.</p>
          <p>
            During inference, these branches are mathematically merged into a single equivalent
convolution kernel, allowing Diverse Branch Block to maintain inference eficiency. This design
enables richer gradient flow and feature diversity during training without incurring runtime
cost. Inspired by Li et al.[
            <xref ref-type="bibr" rid="ref19">19</xref>
            ], we replace the C3k2 modules in the YOLOv11 backbone with
Diverse Branch Block to improve multi-scale feature encoding and enhance detection sensitivity
for fine cracks and irregular textures.
          </p>
        </sec>
        <sec id="sec-3-2-2">
          <title>3.2.2. Triplet Attention Module</title>
          <p>
            Triplet Attention Module (TAT) is an attention mechanism designed to capture spatial and
channel-wise dependencies more efectively by employing a triplet-branch structure. As
illustrated in Figure 4, it comprises three parallel branches, each computing attention along diferent
axis pairs: (height × channel), (width × channel), and (height × width). These branches apply
convolutional transformations followed by sigmoid activation and inter-branch aggregation
to capture richer feature interactions. According to Misra et al.[
            <xref ref-type="bibr" rid="ref20">20</xref>
            ], this structure improves
the network’s ability to focus on salient regions while preserving spatial information across all
directions.
          </p>
          <p>In our model, we insert TAT into the neck of YOLOv11 to enhance spatial–channel attention
before final prediction, improving its ability to detect fine and context-sensitive cracks.</p>
        </sec>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. TransUNet Segmentation Model</title>
        <p>
          TransUNet is a hybrid segmentation network that integrates CNN-based encoders with
Transformer-based global attention, efectively combining local detail extraction and
longrange context modeling. It outperforms traditional U-Net models, especially in scenarios with
irregular shapes or low-contrast boundaries. These capabilities make TransUNet a strong
baseline for pixel-wise segmentation tasks such as road crack detection, where both fine-grained
localization and structural context are essential [
          <xref ref-type="bibr" rid="ref21">21</xref>
          ]. The overall architecture of the TransUNet
framework is shown in Figure 5:
        </p>
        <p>The architecture adopts an encoder-decoder design, where a CNN backbone is used for
hierarchical feature extraction and a Transformer bottleneck enhances global feature representation.
The decoder integrates skip connections to refine spatial detail and recover semantic resolution.
This synergy between convolutional and self-attention mechanisms improves boundary
delineation and semantic coherence, making TransUNet particularly efective for dense prediction
tasks like road crack segmentation.</p>
      </sec>
      <sec id="sec-3-4">
        <title>3.4. Improved TransUNet Model</title>
        <p>As shown in Figure 6, we propose an improved TransUNet model based on BiFormer Block and
Swin Transformer Block.</p>
        <p>
          While TransUNet efectively combines CNN encoders and Transformer-based bottlenecks,
it still faces challenges in preserving fine-grained spatial features, particularly in road crack
segmentation where boundary precision and structural continuity are vital. To address these
issues, we propose an enhanced TransUNet framework that incorporates Swin Transformer
Blocks[
          <xref ref-type="bibr" rid="ref22">22</xref>
          ] to capture local context through hierarchical window-based self-attention, replaces
the ViT bottleneck with a lightweight Biformer Block[
          <xref ref-type="bibr" rid="ref23">23</xref>
          ] to model long-range dependencies
more eficiently, and introduces a Merge module combining Coordinate Attention and SENetV2
to improve feature fusion by enhancing both spatial focus and channel-wise recalibration.
        </p>
        <sec id="sec-3-4-1">
          <title>3.4.1. BiFormer Block</title>
          <p>
            BiFormer is a lightweight vision transformer that introduces a Bi-level Routing Attention (BRA)
mechanism to balance global representation and computational eficiency. As shown in Figure
7, the BiFormer Block incorporates depthwise convolution (DWConv), layer normalization (LN),
and a multilayer perceptron (MLP) alongside the BRA module. The BRA adaptively selects
query-key pairs to reduce redundant attention computation, enabling both global context
modeling and eficient feature routing. Residual connections are used throughout to preserve
gradient flow and feature continuity. Inspired by Wang et al.[
            <xref ref-type="bibr" rid="ref23">23</xref>
            ], we embed the BiFormer
Block at the bottleneck of the TransUNet architecture to capture long-range dependencies with
reduced complexity. This enhances contextual awareness and segmentation accuracy, especially
in challenging cases like elongated or disconnected road cracks.
          </p>
        </sec>
        <sec id="sec-3-4-2">
          <title>3.4.2. Swin Transformer Block</title>
          <p>The Swin Transformer Block represents a hierarchical vision transformer architecture that
eficiently models both local and global dependencies through a window-based multi-head
self-attention mechanism. As illustrated in Figure 8, the block sequentially employs regular
window-based multi-head self-attention (W-MSA) and shifted window-based multi-head
selfattention (SW-MSA), enabling cross-window contextual interaction while preserving linear
computational complexity with respect to image size. Each attention layer is followed by a
multi-layer perceptron (MLP), with both modules encapsulated within residual connections
and layer normalization to enhance optimization stability. By incorporating shifted windows,
the architecture introduces inductive biases such as locality and translational equivariance,
which are absent in standard Transformer designs. Within our framework, the integration
of Swin Transformer Blocks significantly strengthens the encoder’s representational capacity,
particularly in modeling fine-grained structures and preserving boundary continuity critical to
tasks like road crack segmentation.</p>
        </sec>
        <sec id="sec-3-4-3">
          <title>3.4.3. Merge Method</title>
          <p>
            The proposed Merge Method is a hierarchical attention-based fusion mechanism designed to
integrate multi-scale features from parallel encoder branches. It comprises two sequential
attention modules: a coordinate attention block [
            <xref ref-type="bibr" rid="ref24">24</xref>
            ] and a spatial-and-excitation (SaE) module
[
            <xref ref-type="bibr" rid="ref25">25</xref>
            ].
          </p>
          <p>Initially, feature maps from diferent encoding stages are concatenated along the channel
dimension to preserve spatial alignment. The coordinate attention block encodes directional
information via separate height- and width-wise pooling, followed by channel
transformation, thereby enhancing position-aware channel attention for improved localization of crack
boundaries.</p>
          <p>Subsequently, the SaE module models inter-channel dependencies through multi-path
convolutions, capturing contextual diversity across varying receptive fields. This dual-attention
fusion boosts feature selectivity and spatial consistency, ultimately enhancing the efectiveness
of hierarchical feature aggregation for downstream segmentation tasks.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Experiment</title>
      <p>
        In this section, we evaluate the efectiveness of the proposed improved YOLO and TransUNet
models using two benchmark datasets: the oficial Ultralytics Crack Segmentation Dataset
and a reduced version of the Crack500 Dataset[
        <xref ref-type="bibr" rid="ref26">26</xref>
        ]. Experimental results demonstrate that the
enhanced architectures ofer superior performance in crack segmentation tasks, validating the
eficacy of the proposed modifications.
      </p>
      <sec id="sec-4-1">
        <title>4.1. Dataset</title>
        <p>The improved YOLO model is evaluated on the oficial Ultralytics Crack Segmentation Dataset,
which contains a single crack category and is divided into 3,717 training images, 200 validation
images, and 112 testing images. For the improved TransUNet model, we use the reduced version
of the Crack500 dataset due to the high computational cost and memory requirements of the
full dataset, which includes two semantic classes—crack and background—with 2,413 images
for training and 603 for testing.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Implementation details</title>
        <p>All experiments were conducted on a computing platform equipped with an NVIDIA GeForce
RTX 4070 GPU, utilizing PyTorch 2.0.1 and CUDA 11.8 as the deep learning framework. The
improved YOLO model was implemented based on the Ultralytics 8.3.0 framework, with training
performed over 100 epochs using a batch size of 16, while maintaining default hyperparameter
settings. For the improved TransUNet model, the same hardware and software environment
was employed. Training followed an iteration-based strategy with 20,000 iterations, using a
batch size of 4. The Adam optimizer was adopted with an initial learning rate set to 0.0001.</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Evaluation metrics</title>
        <p>Typical metrics used for object detection tasks have been used to evaluate models in this study,
including Precision, Recall, mAP50, and mAP50-95. The definitions of these metrics are as
follows:
  =
 =</p>
        <p>=
 =</p>
        <p>+</p>
        <p>+ 

∑︀ (+1 − ) (+1)
=1</p>
        <p>1 ∑︀ 
 
(1)
(2)
(3)
(4)
(5)</p>
        <p>To further assess the crack detection capability of the improved TransUNet model, the mean
Intersection over Union (mIoU) was incorporated as an additional evaluation metric alongside
Precision and Recall. It’s definition is shown below:
 =
1 ∑︁</p>
        <p>=0   +   +</p>
        <p>The Average Precision (AP) of all classes is the area of the region below the precision-recall
curve.  represents the recall of the th value, and  (+1) represents the
highest precision value in the range  to +1. The mAP is calculated by averaging the AP
of each class in the dataset. mAP50 is obtained by averaging the AP (IoU = 0.5) of all classes,
and mAP50-95 is obtained by averaging the mAPs at diferent IoUs between 0.5 and 0.95. mIoU
measures the average overlap between the predicted and ground truth regions across all classes.</p>
      </sec>
      <sec id="sec-4-4">
        <title>4.4. Improved YOLOV11 Performance Comparison</title>
        <p>To validate the efectiveness of the proposed model in crack detection, YOLOv5 and YOLOv8
are introduced as baseline comparison methods. The performance of each model on the oficial
Ultralytics Crack Segmentation Dataset is summarized in Table 1.</p>
        <p>Compared to YOLOv5, our model improves mAP50 by 4.35% and mAP50–95 by 6.2%,
indicating enhanced detection accuracy. It also outperforms YOLOv8 and YOLOv11 in mAP50–95,
demonstrating better localization. These results confirm the efectiveness of our architectural
improvements for crack detection.</p>
        <p>To validate the efectiveness of the proposed enhancements, ablation study results for each
individual improvement are presented in Table 2.</p>
        <p>The introduction of the Diverse Branch Block leads to a 0.6% increase in mAP50 and a 0.1% gain
in mAP50–95, highlighting its contribution to multi-scale feature enhancement. Incorporating
the TAT module further improves mAP50 by 0.8% and mAP50–95 by 3.3%, demonstrating its
efectiveness in spatial and channel attention. Their combination yields the highest performance,
validating the synergy of both modules.</p>
        <p>To intuitively demonstrate the efectiveness of the proposed improvements, a visual
comparison is conducted between the baseline YOLOv11 and the enhanced model. Representative
results are presented in the figure 9.</p>
        <p>We observe that the proposed method achieves more accurate crack localization and higher
confidence scores compared to YOLOv11. It produces fewer false detections and better captures
complete crack structures, particularly in fine or low-contrast regions, demonstrating improved
detection robustness and spatial precision.</p>
      </sec>
      <sec id="sec-4-5">
        <title>4.5. Improved TransUNet Performance Comparison</title>
        <p>All comparative models in this study were reproduced using the MMSegmentation 0.29.1
framework. To validate the efectiveness of the proposed model in crack segmentation, UNet
and UNet++ are introduced as baseline comparison methods. The performance of each model
on the reduced Crack500 Dataset is summarized in Table 3. All comparative models in this
study were reproduced using the MMSegmentation 0.29.1 framework.</p>
        <p>Compared to TransUnet, our model improves mRecall by 17.36% and mIoU by 9.34%, indicating
enhanced segmentation completeness and region overlap. It also outperforms Unet and Unet++
in all metrics, demonstrating superior precision–recall balance. These results confirm the
efectiveness of our architectural improvements for semantic segmentation.</p>
        <p>To validate the efectiveness of the proposed enhancements, ablation study results for each
individual improvement are presented in Table 4 .</p>
        <p>The integration of the Biformer module leads to an increase of 15.74% in mRecall and 8.11%
in mIoU compared to the baseline TransUNet, highlighting its efectiveness in improving
contextual understanding. Incorporating the Swin Transformer further enhances mIoU by
0.45%, demonstrating its strength in capturing long-range dependencies. The combination of our
architectural refinements achieves the highest mRecall and mIoU, validating the complementary
benefits of both modules for segmentation accuracy.</p>
        <p>To intuitively demonstrate the efectiveness of the proposed improvements, a visual
comparison is conducted between the UNet++ and the enhanced model. Representative results are
presented in the figure10 .</p>
        <p>We observe that the proposed method delivers more precise crack segmentation compared to
UNet++. It better preserves the continuity and topology of fine cracks, especially in noisy or
complex backgrounds. The results exhibit fewer broken or fragmented regions and reduced
false positives, indicating enhanced segmentation accuracy and structural consistency.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion</title>
      <p>In this study, we proposed improved network architectures for both road crack detection and
segmentation tasks. For object detection, we enhanced the backbone of YOLOv11 by introducing
the Diverse Branch Block and integrated the Triplet Attention Module into the neck to improve
spatial and channel attention. For semantic segmentation, we modified the original TransUNet
by incorporating Swin Transformer Blocks and a BiFormer Block, and designed a hierarchical
Merge mechanism based on coordinate and excitation attention to strengthen multi-scale feature
fusion.</p>
      <p>Experiments conducted on the Ultralytics Crack Segmentation Dataset and the reduced
Crack500 dataset demonstrate that our improved YOLOv11 model achieves superior mAP50
and mAP50–95 performance compared to YOLOv5, YOLOv8, and the original YOLOv11.
Similarly, the improved TransUNet outperforms baseline segmentation networks in terms of mIoU,
particularly in capturing fine-grained crack boundaries.</p>
      <p>In summary, our enhanced detection model significantly boosts sensitivity and robustness for
detecting small or subtle cracks, while the improved segmentation model ofers superior spatial
accuracy and contextual understanding, proving the efectiveness of our design in real-world
road crack analysis tasks.</p>
    </sec>
    <sec id="sec-6">
      <title>Declaration on Generative Al</title>
      <p>The author(s) have not employed any Generative Al tools.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Ge</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Meng</surname>
          </string-name>
          ,
          <article-title>A multi-scale information fusion framework with interaction-aware global attention for industrial vision anomaly detection and localization</article-title>
          ,
          <source>Information Fusion</source>
          <volume>124</volume>
          (
          <year>2025</year>
          )
          <fpage>103356</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Ge</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Meng</surname>
          </string-name>
          ,
          <article-title>A survey of deep learning for industrial visual anomaly detection</article-title>
          ,
          <source>Artificial Intelligence Review</source>
          <volume>58</volume>
          (
          <year>2025</year>
          )
          <fpage>279</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>X.</given-names>
            <surname>Yue</surname>
          </string-name>
          , L. Meng,
          <article-title>Yolo-msa: A multiscale stereoscopic attention network for empty-dish recycling robots</article-title>
          ,
          <source>IEEE Transactions on Instrumentation and Measurement</source>
          <volume>72</volume>
          (
          <year>2023</year>
          )
          <fpage>1</fpage>
          -
          <lpage>14</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>T. L.</given-names>
            <surname>Mien</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N. D.</given-names>
            <surname>Tu</surname>
          </string-name>
          ,
          <string-name>
            <surname>N. Van Lam</surname>
          </string-name>
          ,
          <article-title>Deploying yolov8 for real-time road crack detection on smart road length measurement devices</article-title>
          ,
          <source>Journal of Future Artificial Intelligence and Technologies</source>
          <volume>2</volume>
          (
          <year>2025</year>
          )
          <fpage>135</fpage>
          -
          <lpage>144</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Deng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Niu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Guo</surname>
          </string-name>
          ,
          <article-title>Enhancing pavement crack detection using a hybrid convolutional neural network-transformer architecture</article-title>
          ,
          <source>Transportation Research Record</source>
          (
          <year>2025</year>
          )
          <fpage>03611981251329046</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>O.</given-names>
            <surname>Isreal</surname>
          </string-name>
          ,
          <article-title>Integrating spatial and channel attention in deeplabv3 for fine-grained road crack and lane marking segmentation (</article-title>
          <year>2025</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>H.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Jin</surname>
          </string-name>
          , P. Liu, Crackadaptnet:
          <article-title>End-to-end domain adaptation for crack detection and quantification</article-title>
          ,
          <source>Measurement</source>
          (
          <year>2025</year>
          )
          <fpage>117716</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Ge</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Yue</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Meng</surname>
          </string-name>
          ,
          <article-title>Dataset purification-driven lightweight deep learning model construction for empty-dish recycling robot</article-title>
          ,
          <source>IEEE Transactions on Emerging Topics in Computational Intelligence</source>
          (
          <year>2025</year>
          )
          <fpage>1</fpage>
          -
          <lpage>16</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>X.</given-names>
            <surname>Yue</surname>
          </string-name>
          , L. Meng,
          <article-title>Yolo-sm: A lightweight single-class multi-deformation object detection network</article-title>
          ,
          <source>IEEE Transactions on Emerging Topics in Computational Intelligence</source>
          <volume>8</volume>
          (
          <year>2024</year>
          )
          <fpage>2467</fpage>
          -
          <lpage>2480</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>H.</given-names>
            <surname>Xia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Qin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Zhuang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Ming</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <source>Bridge crack detection algorithm designed based on yolov8, Applied Soft Computing</source>
          <volume>149</volume>
          (
          <year>2025</year>
          )
          <article-title>110118</article-title>
          . doi:
          <volume>10</volume>
          .1016/j. asoc.
          <year>2024</year>
          .
          <volume>110118</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>G.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Yan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Ma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <article-title>An improved yolov8-based method for concrete surface crack detection</article-title>
          ,
          <source>Nondestructive Testing and Evaluation</source>
          <volume>40</volume>
          (
          <year>2025</year>
          )
          <fpage>211</fpage>
          -
          <lpage>225</lpage>
          . doi:
          <volume>10</volume>
          . 1080/10589759.
          <year>2025</year>
          .
          <volume>2499032</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>W.</given-names>
            <surname>Ren</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Zhong</surname>
          </string-name>
          ,
          <article-title>Building construction crack detection with bccd-yolo enhanced feature fusion and attention mechanisms</article-title>
          ,
          <source>Scientific Reports</source>
          <volume>15</volume>
          (
          <year>2025</year>
          )
          <article-title>5665</article-title>
          . doi:
          <volume>10</volume>
          .1038/ s41598-025-05665-y.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>H.</given-names>
            <surname>Yoon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. K.</given-names>
            <surname>Kim</surname>
          </string-name>
          , S. Kim, Ppdd:
          <article-title>Egocentric crack segmentation in the port pavement with deep learning-based methods</article-title>
          ,
          <source>Applied Sciences</source>
          <volume>15</volume>
          (
          <year>2025</year>
          )
          <article-title>5446</article-title>
          . doi:
          <volume>10</volume>
          .3390/ app15105446.
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>C.</given-names>
            <surname>Tan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <surname>H. Zhang,</surname>
          </string-name>
          <article-title>Etafhrnet: A transformer-based multi-scale network for asymmetric pavement crack segmentation</article-title>
          ,
          <source>Applied Sciences</source>
          <volume>15</volume>
          (
          <year>2025</year>
          )
          <article-title>6183</article-title>
          . doi:
          <volume>10</volume>
          .3390/app15116183.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>A. A.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Peng</surname>
          </string-name>
          ,
          <article-title>Pixel-level eficient detection of pavement seal cracks using dlanet</article-title>
          ,
          <source>Journal of Infrastructure Systems</source>
          (
          <year>2025</year>
          ). doi:
          <volume>10</volume>
          .1061/JITSE4. ISENG-
          <volume>2537</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>R.</given-names>
            <surname>Khanam</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <article-title>Hussain, Yolov11: An overview of the key architectural enhancements</article-title>
          ,
          <source>arXiv preprint arXiv:2410.17725</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Fan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Zhi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>An</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Gu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Ding</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Tang</surname>
          </string-name>
          ,
          <article-title>Disease monitoring and characterization of feeder road network based on improved yolov11</article-title>
          ,
          <source>Electronics</source>
          <volume>14</volume>
          (
          <year>2025</year>
          )
          <year>1818</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>Q.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <article-title>A novel yolo algorithm integrating attention mechanisms and fuzzy information for pavement crack detection</article-title>
          ,
          <source>International Journal of Computational Intelligence Systems</source>
          <volume>18</volume>
          (
          <year>2025</year>
          )
          <fpage>1</fpage>
          -
          <lpage>25</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>X.</given-names>
            <surname>Ding</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , J. Han,
          <string-name>
            <surname>G</surname>
          </string-name>
          . Ding,
          <article-title>Diverse branch block: Building a convolution as an inception-like unit</article-title>
          ,
          <source>in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>10886</fpage>
          -
          <lpage>10895</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>D.</given-names>
            <surname>Misra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Nalamada</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. U.</given-names>
            <surname>Arasanipalai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Hou</surname>
          </string-name>
          ,
          <article-title>Rotate to attend: Convolutional triplet attention module</article-title>
          ,
          <source>in: Proceedings of the IEEE/CVF winter conference on applications of computer vision</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>3139</fpage>
          -
          <lpage>3148</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>J.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Lu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Luo</surname>
          </string-name>
          , E. Adeli,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Lu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. L.</given-names>
            <surname>Yuille</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhou</surname>
          </string-name>
          , Transunet:
          <article-title>Transformers make strong encoders for medical image segmentation</article-title>
          ,
          <source>arXiv preprint arXiv:2102.04306</source>
          (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Cao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Guo</surname>
          </string-name>
          ,
          <article-title>Swin transformer: Hierarchical vision transformer using shifted windows</article-title>
          ,
          <source>in: Proceedings of the IEEE/CVF international conference on computer vision</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>10012</fpage>
          -
          <lpage>10022</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>L.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Ke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , R. W. Lau,
          <article-title>Biformer: Vision transformer with bi-level routing attention</article-title>
          ,
          <source>in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition</source>
          ,
          <year>2023</year>
          , pp.
          <fpage>10323</fpage>
          -
          <lpage>10333</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>Q.</given-names>
            <surname>Hou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Feng</surname>
          </string-name>
          ,
          <article-title>Coordinate attention for eficient mobile network design</article-title>
          ,
          <source>in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>13713</fpage>
          -
          <lpage>13722</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>M.</given-names>
            <surname>Narayanan</surname>
          </string-name>
          ,
          <article-title>Senetv2: Aggregated dense layer for channelwise and global representations</article-title>
          ,
          <source>arXiv preprint arXiv:2311.10807</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <given-names>F.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Prokhorov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Mei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Ling</surname>
          </string-name>
          ,
          <article-title>Feature pyramid and hierarchical boosting network for pavement crack detection</article-title>
          ,
          <source>IEEE Transactions on Intelligent Transportation Systems</source>
          <volume>21</volume>
          (
          <year>2019</year>
          )
          <fpage>1525</fpage>
          -
          <lpage>1535</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>