<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Oracle Bone Inscription Detection with Multi-Branch Feature Fusion and Eficient Attention ⋆</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Jiaze Cai</string-name>
          <email>caijiaze11@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Qi Li</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Xuexue Zhu</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Bang Li</string-name>
          <email>libang@aynu.edu.cn</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Xin Yan</string-name>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Xia Zhang</string-name>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>College of Science and Engineering, Ritsumeikan University</institution>
          ,
          <addr-line>1-1-1 Noji-higashi, Kusatsu, Shiga, 525-8577</addr-line>
          ,
          <country country="JP">Japan</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Key Laboratory of Oracle Bone Inscriptions Information Processing, Anyang Normal University</institution>
          ,
          <addr-line>Anyang, 455000</addr-line>
          ,
          <country country="CN">China</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>School of computer &amp; information engineering, Anyang Normal University</institution>
          ,
          <addr-line>Anyang, 455000</addr-line>
          ,
          <country country="CN">China</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>State Laboratory of Information Photonics and Optical Communications, Beijing University of Posts and Telecommunications</institution>
          ,
          <addr-line>Beijing, 100876</addr-line>
          ,
          <country country="CN">China</country>
        </aff>
      </contrib-group>
      <fpage>150</fpage>
      <lpage>161</lpage>
      <abstract>
        <p>Oracle bone inscription (OBI) detection is hindered by tiny character size, complex morphological variation, and severe interference from surface erosion, cracks, and background noise in ancient rubbings. To overcome these challenges, we present an enhanced YOLO framework that marries multi-scale feature fusion with eficient attention, enabling robust OBI character detection. Our core innovation is the TriFusion Block (TFB), which synergistically combines three parallel branches: spatial attention for global context modeling, global modeling for semantic feature extraction, and sequential processing for eficient dependency capture. This design enables the network to simultaneously extract fine-grained local details and long-range structural patterns with minimal computational overhead. Extensive experiments on the Oracle-Bone Inscriptions Multimodal Dataset show that the proposed method improves the baseline YOLOv8n by 2.33 % in recall, 8.03 % in precision, and 3.96 % in mAP@0.5, achieving final scores of 81.40 % recall, 91.30 % precision, and 86.35 % mAP@0.5.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Oracle Bone Inscriptions</kwd>
        <kwd>Multi-scale Feature Fusion</kwd>
        <kwd>YOLOv8 Ancient Script Detection</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Oracle bone inscriptions (OBI) are the oldest attested form of Chinese writing, carved more
than three millennia ago, mainly in turtle plastrons and animal scapulae, during the late Shang
dynasty[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. As rare physical artefacts, OBI embody a wealth of historical information and
cultural significance. Their distinctive orthography and textual content ofer first-hand evidence
for tracing the evolution of Chinese characters and for exploring early Chinese social structures,
religious practices, and historical events. Consequently, the accurate detection of OBIs is not
merely a technical challenge in palaeography and archaeology, but a foundational task for
data-driven reconstructions of Shang-era civilisation.
      </p>
      <p>
        In recent years, computer vision and deep-learning techniques have advanced rapidly.
Convolutional-neural-network (CNN)[
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] object-detectors—especially the YOLO (You Only
Look Once) series—have achieved notable success in image detection[
        <xref ref-type="bibr" rid="ref3 ref4">3, 4</xref>
        ]. Integrating these
deep-learning detectors into OBI research can substantially improve detection accuracy and
eficiency while reducing manual labour, thereby accelerating the large-scale digitisation of
oracle-bone materials. These advances will support high-quality digital repositories, facilitate
scholarly decipherment, and lay a foundation for cross-disciplinary data mining and intelligent
analysis, giving the endeavour substantial theoretical and practical value.
      </p>
      <p>
        Nevertheless, OBI detection still faces substantial challenges. Most available images are
rubbings or fragmented pieces whose quality varies widely and often sufers from blur, surface
damage, and geometric distortion. Each character is a minute target easily occluded by cracks,
abrasion, and background noise; even lightweight detectors still produce numerous misses and
false positives on high-resolution inputs. Recent advances in eficient attention mechanisms[
        <xref ref-type="bibr" rid="ref5 ref6">5,
6</xref>
        ] and state-space modeling[
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] have demonstrated promising capabilities for handling such
complex scenarios with reduced computational overhead, yet their application to oracle bone
character detection remains underexplored.
      </p>
      <p>This study tackles the detection of minute, heavily eroded oracle-bone characters by
introducing a multi-scale enhancement module that fuses global attention, content clustering, and
state-space modeling. The module is inserted into both the YOLOv8n backbone and the SPPF
block, enabling the network to capture global, local, and sequential cues with minimal
computational overhead. A systematic study of insertion stages and fusion weights shows that the
resulting model yields substantial accuracy gains in OBI detection, confirming the practicality
and portability of this plug-and-play, module-level paradigm for low-resource ancient-script
tasks.</p>
      <p>Overall, our main contributions can be summarized below:
• Within YOLOv8n, we introduce a TriFusion Block (TFB) that fuses global attention,
content clustering, and state-space cues in a single residual unit, enabling eficient feature
mixing with minimal overhead.
• Building on TFB, we design TriFusion-SPPF (TF-SPPF)—an enhanced
spatial-pyramid-pooling module that enlarges the receptive field via hierarchical
pooling and fuses multi-scale features with Transformer attention, enabling the
network’s upper layers to unify local detail and global context.</p>
      <p>The remainder of this article is organized as follows. In Section 2, we review related work on
Oracle bone inscriptions detection. Section 3 elaborates our proposed TriFusion-YOLO
framework, elaborating on the TFB and TF-SPPF as well as their attention formulations and training
losses. Section 4 presents experimental settings, evaluation metrics, and both quantitative and
qualitative analyses.Finally, Section 5 concludes the paper and discusses future directions.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>
        Oracle bone inscription detection has become a key research topic in computational
archaeology and computer vision alike. Early studies relied chiefly on conventional computer-vision
techniques—template matching, morphological processing, and graph-based reconstruction[
        <xref ref-type="bibr" rid="ref8">8</xref>
        ].
However, these methods showed limited stability and accuracy when confronted with the
complex textures and damage patterns of oracle-bone rubbings.
      </p>
      <p>
        Driven by advances in deep learning, researchers have increasingly adopted neural-network
approaches for OBI detection. Zhen et al.[
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] proposed an improved YOLOv8 framework that
integrates a small-object head, revised loss functions, and attention modules to boost detection
performance. Xu et al.[
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] developed an intelligent detection model that couples Otsu
thresholding with a modified YOLOv8 and employs a slim neck to improve small-object detection.
The YOLO family has proved highly efective for OBI-detection tasks. Li et al.[
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] proposed a
lightweight oracle-character detector built on an improved YOLOv7-tiny architecture; it
integrates partial convolution and an asymptotic feature-pyramid network, reducing computation
while preserving accuracy. Li and Du[
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] built a complete pipeline that employs YOLOv8 for
character detection and ResNet-18 for classification.
      </p>
      <p>
        Beyond single-model approaches, researchers have explored multi-stage recognition
frameworks to address detection limitations. Fujikawa[
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] proposed a two-model system combining
YOLOv3-tiny for initial character detection and MobileNet for secondary recognition of missed
characters, achieving 98.89% validation accuracy with significant computational eficiency.
Similarly, Meng et al.[
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] developed a two-stage recognition method that first extracts skeletal
features using the Hough transform and then applies template matching with checkpoint hit
rates, demonstrating nearly 90% recognition accuracy even under character inclination and
damage conditions.
      </p>
      <sec id="sec-2-1">
        <title>2.1. Multi-scale Feature Fusion and Attention Mechanisms in OBI Detection</title>
        <p>
          Multi-scale feature fusion and attention mechanisms have been widely studied to improve
OBI-detection accuracy. Liu et al.[
          <xref ref-type="bibr" rid="ref15">15</xref>
          ] proposed an oracle-character detection system built on
an improved YOLOv7 that adds CoordConv layers and replaces classical NMS with matrix NMS,
boosting both accuracy and inference speed. Tang et al.[
          <xref ref-type="bibr" rid="ref16">16</xref>
          ] built an intelligent system that
employs YOLOv5 for character segmentation and ResNet-50 for classification, achieving robust
results through extensive image pre-processing and transfer-learning strategies. Addressing
the challenge of insuficient and imbalanced oracle bone datasets, Yue et al.[
          <xref ref-type="bibr" rid="ref17">17</xref>
          ] introduced
Dynamic Data Augmentation (DDA) strategies that adaptively adjust augmentation policies
based on real-time model performance during training. Their approach achieved 8.1% accuracy
improvement over baseline Inception networks on the OBC306 dataset, demonstrating the
importance of adaptive training strategies for handling incomplete character structures and
damaged rubbings typical in oracle bone materials.
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Methodology</title>
      <p>
        In this study, we propose a YOLO architecture enhanced by modular multimodal feature fusion.
The core innovation is the TriFusion Block (TFB), which integrates spatial attention, global
modeling, and sequence-processing branches in parallel to capture local details, global structure,
and spatial dependencies. We further design TriFusion-SPPF (TF-SPPF), which augments
conventional spatial-pyramid pooling with TFB-based enhancement to strengthen multi-scale
feature representation. The modules plug seamlessly into YOLOv8n, maintaining its eficiency
while markedly improving detection accuracy, and ofer strong scalability and transferability.The
overall architecture of our improved YOLOv8n detector is illustrated in Fig. 2.
3.1. YOLOv8n
YOLO, first proposed by researchers at the University of Washington in 2015, is an eficient
object-detection framework noted for its balance of speed and accuracy[
        <xref ref-type="bibr" rid="ref18 ref19">18, 19</xref>
        ]. Released by
Ultralytics in 2023[
        <xref ref-type="bibr" rid="ref20">20</xref>
        ], YOLOv8 advances the series with several architectural innovations that
noticeably improve feature extraction and detection performance.
      </p>
      <p>In this study, we adopt YOLOv8n as the baseline owing to its lightweight design, which ofers
fast inference while maintaining strong detection accuracy. YOLOv8n follows a three-stage
design: the backbone extracts multi-scale features, the neck fuses them, and the head performs
localisation and classification. Relative to earlier versions, YOLOv8 introduces an anchor-free
mechanism and a decoupled head, which respectively improve localisation flexibility and
taskspecific performance. In addition, YOLOv8 incorporates Task-Aligned Assigner for sample
allocation and Distributive Focal Loss (DFL) for bounding-box regression, providing a robust
foundation for oracle-bone character detection.</p>
      <sec id="sec-3-1">
        <title>3.2. TriFusion Block</title>
        <p>We propose the TFB, which integrates three parallel branches—a spatial-attention branch
for global spatial modeling, a global-modeling branch for semantic-context extraction, and a
sequential-processing branch for eficient dependency modeling. By combining these
complementary pathways, TFB simultaneously captures fine-grained local features and global structural
patterns, substantially improving the feature representation for oracle-bone character detection.</p>
        <sec id="sec-3-1-1">
          <title>3.2.1. Spatial Attention Branch</title>
          <p>To explicitly model long-range pixel dependencies in space, we embed a lightweight multi-head
self-attention (MHSA) branch into the TriFusion Block. Traditional convolutional operations
are constrained by local receptive fields, whereas this spatial attention mechanism enables
direct interactions between any two spatial positions—an ability that is crucial for detecting
sparsely distributed oracle-bone characters amid complex morphological variations and surface
erosion. The branch first flattens the 2 -D feature map  ∈ R××× into a sequence of
length  =  ×  , and then applies scaled dot-product attention with 32 heads to compute
pairwise interactions. The attention-weighted features are finally reshaped back to their original
spatial size, yielding context-rich representations for subsequent fusion stages.</p>
          <p>X ∈ R×××</p>
          <p>lfatten
→−− − X</p>
          <p>lfat ∈ R××
Attention = Softmax
︂( QℎKℎ⊤ )︂
√
(1)
(2)</p>
          <p>Here,  =  ×  denotes the flattened sequence length, and  = /32 represents the
dimensionality per attention head. The flatten operation transforms 2D spatial features into
1D sequences, enabling direct interactions between any two spatial positions. The scaled
dotproduct attention establishes long-range spatial dependencies, which are crucial for detecting
sparsely distributed oracle-bone characters amid complex morphological variations and surface
erosion.</p>
        </sec>
        <sec id="sec-3-1-2">
          <title>3.2.2. Global Modeling Branch</title>
          <p>To capture semantic context and long-range dependencies of oracle-bone characters, we
employ a standard Transformer encoder in the global-modeling branch. Unlike pure attention
mechanisms, this approach combines global contextual encoding with non-linear feature
transformation, which is essential for understanding semantically sparse and visually ambiguous
oracle-bone characters. The method utilizes a double-residual Transformer block design: Stage
1 applies multi-head self-attention for dependency modeling, and Stage 2 employs LayerNorm
and MLP for semantic enhancement. The input feature map [B, C, H, W] is first reshaped to [B,
HW, C] for sequence modeling, followed by residual connections and a final 1×1 convolution
for feature refinement.</p>
          <p>Here, MHSA denotes multi-head self-attention with 16 heads, and MLP represents a
feedforward network with 3× channel expansion and GELU activation. The double-residual design
ensures stable gradient flow during training while efectively capturing semantic features and
long-range contextual information essential for oracle-bone character understanding.</p>
        </sec>
        <sec id="sec-3-1-3">
          <title>3.2.3. Sequential Processing Block</title>
          <p>To eficiently model long-range dependencies in oracle-bone character sequences with lower
computational cost, we employ a simplified state-space model (SSM) in the sequential-processing
branch. Unlike traditional attention mechanisms with O(N²) complexity, the SSM operates in
O(N) linear time, making it especially suitable for high-resolution oracle-bone rubbing images
with dense character distributions. This approach is vital for detecting visually ambiguous
and semantically sparse oracle-bone characters, where sequential relationships provide crucial
contextual cues. The method consists of two stages: Stage 1 applies SSM with LayerNorm for
eficient dependency modelling; Stage 2 applies a feed-forward network (FFN) with LayerNorm
for semantic enhancement. The input is reshaped from [B, C, H, W] to [B, HW, C] for sequence
processing, followed by residual connections and a final 1 × 1 convolution for feature refinement.
1 = flat + MHSA(LayerNorm(flat ))</p>
          <p>2 = 1 + MLP(LayerNorm(1))
1 = flat + SSM(LayerNorm(flat ))
2 = 1 + FFN(LayerNorm(1))
(3)
(4)
(5)
(6)</p>
          <p>Here, SSM denotes a simplified state-space model with Linear-GELU-Linear structure, and
FFN represents a feed-forward network with 2× channel expansion. Unlike traditional attention
mechanisms with O(N²) complexity, the SSM operates in O(N) linear time, making it
particularly eficient for processing high-resolution oracle-bone rubbing images with dense character
distributions.</p>
        </sec>
        <sec id="sec-3-1-4">
          <title>3.2.4. Adaptive Multimodal Feature Fusion</title>
          <p>To integrate the complementary information from the three parallel branches, we propose
an adaptive multimodal fusion module. The input feature map  is fed concurrently to the
spatial-attention, global-modeling, and sequential branches, producing three feature tensors.
Learnable scalar weights adaptively aggregate these tensors, which are then concatenated
along the channel dimension. A lightweight Conv–BatchNorm–GELU block further blends
the concatenated features, after which a channel-attention unit—global average pooling, two
1 × 1 convolutions, and a sigmoid—re-calibrates the activations. Finally, a residual shortcut
adds the fused features to the original input, delivering adaptive integration with negligible
computational overhead.</p>
        </sec>
      </sec>
      <sec id="sec-3-2">
        <title>3.3. TF-SPPF Enhancement Design</title>
        <p>To boost feature representation without sacrificing the original multi-scale capability, we design
the TriFusion-SPPF (TF-SPPF) module as a non-intrusive wrapper around the standard SPPF.
TF-SPPF preserves the cascaded triple 5 × 5 max-pooling stages and the multi-scale fusion
of the original SPPF. After SPPF, the output features flow into the Combined-Enhancement
module for multimodal refinement, which adaptively fuses three parallel branches—spatial
attention, global modeling, and sequential processing—to enrich semantic representation. An
adjustment layer of 1 × 1 convolution, BatchNorm, and SiLU activation further refines the
fused features. To improve feature quality, a channel re-calibration block applies global average
pooling, an eight-fold channel reduction, SiLU activation, expansion, and Sigmoid gating to
produce channel-attention weights. Finally, a conditional residual shortcut is used when the
input and output shapes match; otherwise, the recalibrated features are forwarded directly.
This design retains the eficient multi-scale modeling of SPPF while markedly boosting feature
discriminability through combined attention and re-calibration mechanisms.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Experiment</title>
      <p>In this section, we conduct comprehensive experiments to evaluate the efectiveness of our
proposed TriFusion-YOLO framework for oracle bone character detection. To demonstrate the
superiority of our method, we perform extensive evaluations on the Oracle-Bone Inscriptions
Multimodal Dataset, comparing our approach against state-of-the-art object detection
models including the baseline YOLOv8n and the latest YOLOv11n. Our experimental evaluation
encompasses both quantitative and qualitative analyses, examining detection accuracy,
computational eficiency, and visual performance across diverse oracle bone rubbing scenarios. We
systematically investigate the contribution of each component through detailed ablation studies,
analyzing the individual and combined efects of the spatial attention branch, global modeling
branch, and sequential processing branch within the TriFusion Block. Additionally, we provide
thorough implementation details, evaluation metrics, and performance comparisons to ensure
reproducibility and fair assessment of our proposed method.</p>
      <sec id="sec-4-1">
        <title>4.1. Dataset</title>
        <p>
          We use the rubbing subset of the Oracle-Bone Inscriptions Multimodal Dataset (OBIMD)[
          <xref ref-type="bibr" rid="ref21">21</xref>
          ],
which comprises 10,077 high-quality rubbing images sampled from five historical phases of
Yinxu. Each image is professionally annotated by domain experts with bounding boxes and
category labels for every character, fully reflecting real-world challenges in oracle-bone
detection—character diversity, scale variation, and complex backgrounds. The annotation workflow
combines AI-assisted pre-labelling with expert verification to ensure high data quality and
accuracy. Overall, the dataset captures the diversity and complexity of real-world oracle-bone
detection scenarios.
        </p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Implementation details</title>
        <p>All experiments were run on a workstation equipped with an NVIDIA GeForce RTX 4070
GPU (12 GB). The experiments used PyTorch 2.6.0 with CUDA 11.8. The proposed method
was implemented with Ultralytics 8.3.114, an oficial framework for YOLOv8n. OpenCV 4.11.0
handled image pre-processing.</p>
        <p>Models were trained for 200 epochs with a batch size of 32. Input images were resized to
640 × 640. Optimisation used AdamW (initial LR = 0.001; final LR = 0.01; momentum = 0.937;
weight decay = 1 × 10 −4 ). The learning-rate schedule followed a warm-up cosine-annealing
pattern: LR increases during the first five epochs and then decays following a cosine curve.
Early stopping (patience = 50 epochs) prevented over-fitting. Mixed-precision training (AMP)
was enabled to improve eficiency and memory usage, and data-loading workers were set to 8.</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Evaluation Metrics</title>
        <p>The model’s performance was evaluated using three critical metrics: Precision, Recall, and
mAP50. These metrics provided key insights into the model’s detection accuracy, recall
capability, and overall performance on the validation set. Precision assessed the proportion of
correct predictions among all positive predictions, Recall measured the ability to detect all
relevant characters, and mAP0.5 ofered a comprehensive measure of detection efectiveness.
The metrics are defined as follows:</p>
        <p>Precision =</p>
        <p>TruePositives
TruePositives + FalsePositives
(7)
Ours</p>
        <p>For each class, Average Precision (AP) is defined as the area under the precision–recall
curve, where precision is plotted against recall from 0 to 1. The mean Average Precision at an
IoU threshold of 0.5 (mAP0.5) equals the average AP over all categories in the dataset. This
metric jointly evaluates precision and recall, capturing the model’s ability to detect and classify
oracle-bone characters while limiting false positives; it is therefore well suited to complex
archaeological document-analysis tasks.</p>
      </sec>
      <sec id="sec-4-4">
        <title>4.4. Performance Comparison</title>
        <p>To assess the efectiveness of the proposed TriFusion-YOLO model, we systematically
compare its performance with the baseline YOLOv8n and the latest YOLOv11n on an oracle-bone
character dataset. All models are trained under identical settings—including dataset splits,
hyper-parameters, and evaluation metrics—to ensure a fair comparison. Performance is
evaluated using Precision, Recall, and mAP@0.5, with special attention to detection accuracy and the
reduction of false positives in complex archaeological documents.</p>
        <p>Table 1 compares the performance of diferent models on the oracle-bone character detection
dataset. The baseline YOLOv8n achieves a recall of 79.07%, precision of 83.27%, and mAP@0.5
of 82.39%. YOLOv11n shows slight improvements, with a recall of 79.33%, precision of 87.64%,
and mAP@0.5 of 83.93%. In contrast, the TriFusion-YOLO model significantly outperforms both
baseline methods, attaining a recall of 81.40%, precision of 91.30%, and mAP@0.5 of 86.35%.
Notably, our method demonstrates substantial gains over YOLOv8n, with recall improved by
2.33%, precision enhanced by 8.03%, and mAP@0.5 increased by 3.96%. These results confirm the
efectiveness of the TriFusion Block and TF-SPPF modules in oracle-bone character detection,
especially in reducing false positives while maintaining high detection accuracy.
✓
✓
✓
✓
✓
✓
✓
✓</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion</title>
      <p>In this work, we propose proposes the TriFusion Block (TFB), a multi-branch fusion module that
integrates spatial attention, global modeling, and sequential processing to improve oracle-bone
character detection, especially in complex and severely degraded rubbing images.</p>
      <p>Integrated into the YOLOv8n backbone, our method leverages complementary capabilities
from each branch to enhance both localization and classification performance. Extensive
experiments on the Oracle-Bone Inscriptions Multimodal Dataset show that our approach
achieves a recall of 81.40%, precision of 91.30%, and mAP@0.5 of 86.35%, surpassing the baseline
model by 2.33, 8.03, and 3.96 percentage points, respectively. Ablation studies confirm that each
component of TFB contributes positively, validating the efectiveness and generalizability of
our multi-branch fusion strategy.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <p>This research was supported by the Natural Science Foundation of China (Grant No. 62506007),
the Natural Science Foundation of Henan Province (Grant No. 242300420680), the
Paleography and Chinese Civilization Inheritance and Development Program (Grant Nos. G1807,
G1806, G2821), the Henan Province Science and Technology Research Project (Grant Nos.
242102210116, 252102321071), Major Science and Technology Project of Anyang (Grant No.
2025A02SF007) and the Henan Province High-Level Talents International Training Program
(Grant No. GCC2025028).</p>
    </sec>
    <sec id="sec-7">
      <title>Declaration on Generative AI</title>
      <p>The author(s) have not employed any Generative AI tools.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>J.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Chi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <surname>C.-L. Liu</surname>
          </string-name>
          ,
          <article-title>A comprehensive survey of oracle character recognition: Challenges, benchmarks, and beyond</article-title>
          , Benchmarks, and
          <string-name>
            <surname>Beyond</surname>
          </string-name>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>L.</given-names>
            <surname>Alzubaidi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. J.</given-names>
            <surname>Humaidi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Al-Dujaili</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Duan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Al-Shamma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Santamaría</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. A.</given-names>
            <surname>Fadhel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Al-Amidie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Farhan</surname>
          </string-name>
          ,
          <article-title>Review of deep learning: concepts, cnn architectures, challenges, applications, future directions</article-title>
          ,
          <source>Journal of big Data</source>
          <volume>8</volume>
          (
          <year>2021</year>
          )
          <fpage>1</fpage>
          -
          <lpage>74</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>J.</given-names>
            <surname>Redmon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Divvala</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Girshick</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Farhadi</surname>
          </string-name>
          ,
          <article-title>You only look once: Unified, real-time object detection</article-title>
          ,
          <source>in: Proceedings of the IEEE conference on computer vision and pattern recognition</source>
          ,
          <year>2016</year>
          , pp.
          <fpage>779</fpage>
          -
          <lpage>788</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>A.</given-names>
            <surname>Bochkovskiy</surname>
          </string-name>
          , C.-Y. Wang, H.
          <string-name>
            <surname>-Y. M. Liao</surname>
          </string-name>
          ,
          <article-title>Yolov4: Optimal speed and accuracy of object detection</article-title>
          , arXiv preprint arXiv:
          <year>2004</year>
          .
          <volume>10934</volume>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Ding</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Lu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Pai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Ma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. D.</given-names>
            <surname>Haefele</surname>
          </string-name>
          , Token statistics transformer:
          <article-title>Linear-time attention via variational rate reduction</article-title>
          ,
          <source>in: The Thirteenth International Conference on Learning Representations</source>
          ,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>X.</given-names>
            <surname>Liu</surname>
          </string-name>
          , J. Liu,
          <string-name>
            <given-names>J.</given-names>
            <surname>Tang</surname>
          </string-name>
          , G. Wu, Catanet:
          <article-title>Eficient content-aware token aggregation for lightweight image super-resolution</article-title>
          ,
          <source>in: Proceedings of the Computer Vision and Pattern Recognition Conference</source>
          ,
          <year>2025</year>
          , pp.
          <fpage>17902</fpage>
          -
          <lpage>17912</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>L.</given-names>
            <surname>Kong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Dong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Tang</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.-H. Yang</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Pan</surname>
          </string-name>
          ,
          <article-title>Eficient visual state space model for image deblurring</article-title>
          ,
          <source>in: Proceedings of the Computer Vision and Pattern Recognition Conference</source>
          ,
          <year>2025</year>
          , pp.
          <fpage>12710</fpage>
          -
          <lpage>12719</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>S.</given-names>
            <surname>Gu</surname>
          </string-name>
          ,
          <article-title>Identification of oracle-bone script fonts based on topological registration</article-title>
          ,
          <source>Computer &amp; Digital Engineering</source>
          <volume>10</volume>
          (
          <year>2016</year>
          )
          <fpage>029</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Q.</given-names>
            <surname>Zhen</surname>
          </string-name>
          , L. Wu, G. Liu,
          <article-title>An oracle bone inscriptions detection algorithm based on improved yolov8</article-title>
          ,
          <source>Algorithms</source>
          <volume>17</volume>
          (
          <year>2024</year>
          )
          <fpage>174</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>H.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Liang</surname>
          </string-name>
          ,
          <string-name>
            <surname>Z. Lin,</surname>
          </string-name>
          <article-title>Research on the intelligent oracle bone script recognition model based on otsu and improved yolov8</article-title>
          ,
          <source>in: 2024 IEEE 4th International Conference on Electronic Technology, Communication and Information (ICETCI)</source>
          , IEEE,
          <year>2024</year>
          , pp.
          <fpage>972</fpage>
          -
          <lpage>976</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , W. Sun,
          <article-title>Lightweight oracle bone character detection algorithm based on improved yolov7-tiny</article-title>
          , in: 2024
          <source>IEEE International Conference on Mechatronics and Automation (ICMA)</source>
          , IEEE,
          <year>2024</year>
          , pp.
          <fpage>485</fpage>
          -
          <lpage>490</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>D.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <surname>B. Du,</surname>
          </string-name>
          <article-title>Research on oracle bone inscription segmentation and recognition model based on deep learning</article-title>
          ,
          <source>in: 2024 IEEE 4th International Conference on Electronic Technology, Communication and Information (ICETCI)</source>
          , IEEE,
          <year>2024</year>
          , pp.
          <fpage>1309</fpage>
          -
          <lpage>1314</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Fujikawa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Yue</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Aravinda</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. A.</given-names>
            <surname>Prabhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Meng</surname>
          </string-name>
          ,
          <article-title>Recognition of oracle bone inscriptions by using two deep learning models</article-title>
          ,
          <source>International Journal of Digital Humanities</source>
          <volume>5</volume>
          (
          <year>2023</year>
          )
          <fpage>65</fpage>
          -
          <lpage>79</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>L.</given-names>
            <surname>Meng</surname>
          </string-name>
          ,
          <article-title>Two-stage recognition for oracle bone inscriptions</article-title>
          ,
          <source>in: International conference on image analysis and processing</source>
          , Springer,
          <year>2017</year>
          , pp.
          <fpage>672</fpage>
          -
          <lpage>682</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>J.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <article-title>Oracle recognition based on improved yolov7</article-title>
          ,
          <source>in: International Conference on Computer Graphics, Artificial Intelligence, and Data Processing (ICCAID</source>
          <year>2023</year>
          ), volume
          <volume>13105</volume>
          ,
          <string-name>
            <surname>SPIE</surname>
          </string-name>
          ,
          <year>2024</year>
          , pp.
          <fpage>521</fpage>
          -
          <lpage>527</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>H.</given-names>
            <surname>Tang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Tang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <article-title>Oracle bone script intelligent recognition: Automatic segmentation and recognition of original rubbing single characters</article-title>
          ,
          <source>in: 2024 5th International Conference on Electronic Communication and Artificial Intelligence (ICECAI)</source>
          , IEEE,
          <year>2024</year>
          , pp.
          <fpage>414</fpage>
          -
          <lpage>418</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>X.</given-names>
            <surname>Yue</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Fujikawa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Meng</surname>
          </string-name>
          ,
          <article-title>Dynamic dataset augmentation for deep learning-based oracle bone inscriptions recognition</article-title>
          ,
          <source>ACM Journal on Computing and Cultural Heritage</source>
          <volume>15</volume>
          (
          <year>2022</year>
          )
          <fpage>1</fpage>
          -
          <lpage>20</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>P.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Ergu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Cai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Ma</surname>
          </string-name>
          ,
          <article-title>A review of yolo algorithm developments</article-title>
          ,
          <source>Procedia computer science 199</source>
          (
          <year>2022</year>
          )
          <fpage>1066</fpage>
          -
          <lpage>1073</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>J.</given-names>
            <surname>Terven</surname>
          </string-name>
          ,
          <string-name>
            <surname>D.-M. Córdova-Esparza</surname>
            ,
            <given-names>J.-A.</given-names>
          </string-name>
          <string-name>
            <surname>Romero-González</surname>
          </string-name>
          ,
          <article-title>A comprehensive review of yolo architectures in computer vision: From yolov1 to yolov8 and yolo-nas, Machine learning</article-title>
          and
          <source>knowledge extraction 5</source>
          (
          <year>2023</year>
          )
          <fpage>1680</fpage>
          -
          <lpage>1716</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>M.</given-names>
            <surname>Sohan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. Sai</given-names>
            <surname>Ram</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. V.</given-names>
            <surname>Rami Reddy</surname>
          </string-name>
          ,
          <article-title>A review on yolov8 and its advancements</article-title>
          ,
          <source>in: International Conference on Data Intelligence and Cognitive Informatics</source>
          , Springer,
          <year>2024</year>
          , pp.
          <fpage>529</fpage>
          -
          <lpage>545</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>B.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Luo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Liang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Ding</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Peng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Jiang</surname>
          </string-name>
          , S. Han,
          <string-name>
            <surname>D</surname>
          </string-name>
          . Sui,
          <string-name>
            <given-names>P.</given-names>
            <surname>Qin</surname>
          </string-name>
          , et al.,
          <article-title>Oracle bone inscriptions multi-modal dataset</article-title>
          ,
          <source>arXiv preprint arXiv:2407.03900</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>