-

Radical-Conditioned Difusion Model for Oracle Bone Character Generation and Analysis

Zengmao Ding

Xiaoping He

Xiao Li

Qi Li

Xin Yan

xyan@bupt.edu.cn 3

Xia Zhang

xzhang@bupt.edu.cn 3

Bang Li

libang@aynu.edu.cn 1 0 College of Science and Engineering, Ritsumeikan University , 1-1-1 Noji-higashi, Kusatsu, Shiga, 525-8577 , Japan 1 Key Laboratory of Oracle Bone Inscriptions Information Processing, Anyang Normal University , Anyang , China 2 School of computer & information engineering, Anyang Normal University , Anyang , China 3 State Laboratory of Information Photonics and Optical Communications, Beijing University of Posts and Telecommunications , Beijing 100876 , China

39 50

Oracle bone inscriptions (OBI), the earliest form of Chinese writing, are composed of complex characters built from recurring radical components. Understanding how these components form full characters is critical to studying the semantics and structure of early writing systems. In this paper, we propose a radical-conditioned difusion model that synthesizes plausible OBI characters given a set of radicals and their counts. Our method encodes radical identity and positional context into structured embeddings, which condition the generation process via cross-attention in a U-Net backbone. To better preserve radical morphology and visual coherence, we introduce a perceptual loss that adapts dynamically during denoising. Experiments show that our model not only generates visually consistent and structurally valid characters, but also improves multi-label classification when used for data augmentation. These results demonstrate the potential of component-level generation as a tool for character reconstruction and structural analysis in ancient scripts.

eol>Oracle Bone inscriptions Multi-instance Image Generation DDPM Deep learning

However, it is important to emphasize that as an early stage in the development of Chinese characters, OBI exhibits a certain regularity in its radical structure, but the forms of its components are not absolutely fixed [ 2 ]. A key observation is that the same radical or component often displays subtle yet significant morphological variations across diferent OBI character forms. These diferences are not arbitrary scribbles [ 3 ]; they frequently represent adaptive adjustments made by the scribe. These adjustments aimed to better integrate the component into the specific meaning or overall structure of the character it was part of, or were constrained by factors like the writing implement and spatial layout [ 4 ]. These nuanced morphological variations contain rich information about character construction, semantic associations, and writing conventions. They constitute a valuable window for researching the character-creation mindset and evolutionary patterns of OBI, holding extremely high value for systematic study.

OBC

Radical Annotations

OBC

Radical Annotations

To delve deeper into these characteristics of OBI radicals and their role in character formation, we model the relationship between a radical in an oracle bone character and its instances in diferent characters. As shown in Figure 1, the relationships between diferent radicals are not simply linear combinations, but involve more complex structural compositions, this task presents certain challenges, manifested primarily in data scarcity and in designing relational modeling specifically tailored to the aforementioned characteristics. To address these challenges, we utilize radical data from the YinQiWenYuan [ 6 ] database and leverage the powerful associative generative capabilities of difusion models. We train a radical-guided difusion model for OBI based on this data. The core idea is to input specific radical set information as a condition to guide the model in generating target oracle bone characters containing that radical set. During this process, we observe that the model spontaneously evolves a bias for diferent radicals under this training paradigm. This bias is manifested as the model spontaneously capturing and preserving the subtle morphological diferences of radicals when they participate in forming diferent characters (as mentioned above), as well as summarizing the positional regularity of radicals within OBI characters. This bias exhibited during the generation process provides us with a novel perspective to analyze the morphological characteristics, variation patterns of OBI radicals, and their deep associations with character meaning and structure.

In addition to structural analysis, radical-conditioned character generation provides a practical benefit in low-resource settings. Given the scarcity and imbalance of radical annotations in existing OBI datasets, we hypothesize that our model can generate structurally valid yet diverse character forms to augment the training data. These synthetic samples may help improve downstream classification tasks, particularly in the presence of rare radical combinations. We validate this hypothesis through a controlled data augmentation experiment in Section 3.3.

Overall, our main contributions can be summarized below: • We propose a new generative task that maps oracle radicals to full characters, framing character construction as a component-conditioned generation process. • We show that the generation behavior of a difusion model inherently reflects structural biases, allowing us to analyze radical similarity, spatial regularity, and semantic stability through its learned latent space.

The remainder of this article is organized as follows. In Section 1, we review related work on oracle bone recognition and multi-instance image generation. Section 2 details our proposed radical-conditioned difusion framework, including the embedding design and loss functions. Section 3 presents experimental settings, evaluation metrics, and both quantitative and qualitative analyses.Finally, Section 4 concludes the paper and discusses future directions.

1. Related Work 1.1. OBI Recognition

The recognition of OBI aims to classify characters in hand-written or authentic OBI images. Recent advancements [ 7, 8, 9, 10 ] in Oracle Bone Inscription (OBI) recognition, particularly to address challenges in recognizing complete characters, have highlighted the potential of component-level analysis. For instance, frameworks like the Oracle Bone Inscription Component Analysis proposed by Zhao et al. [ 1 ] utilize image similarity metrics to extract and compare radical components, revealing their structural roles across diferent OBI characters. Similarly, studies on character evolution, such as those employing few-shot learning to trace morphological changes from OBI to modern scripts, demonstrate that radicals undergo simplification, merging, and stroke variations to adapt to diverse character compositions [ 11 ]. These findings underscore the necessity of modeling radical-specific variations to achieve robust character generation. To support such modeling, large-scale datasets like HUST-OBC [ 12 ] provide a rich resource, containing 77,064 deciphered and 62,989 undeciphered character images, many of which exhibit significant radical variations due to evolving writing styles.

1.2. Multi-instance Image Generation

Multi-instance image generation (MIIG) focuses on synthesizing complex scenes containing multiple objects with precise spatial relationships and instance-specific attributes. Early text-toas +

= Radical Label Class Embedding Count Embedding 1

2 Count Label 1

Concat

MLP

Unet DBloowckn DBloowckn BMloicdk BUlopck BUlopck

CrossAttention

Attention Maps image models struggled with compositional consistency, leading to innovations in instance-level control [ 13 ]. InstanceDifusion [ 14 ] pioneered unified instance conditioning via its UniFusion module, supporting flexible location inputs (points, masks, boxes) and per-instance textual descriptions. Large-scale text-to-image difusion models like Stable Difusion [ 15 ], GLIDE [ 16 ], Imagen [ 17 ], and DALL·E 2 [ 18 ] Generate rich instances by using rich text combinations.

These works collectively address core MIIG challenges: unifying diverse instance conditions, mitigating inter-instance interference, and scaling relational reasoning. However, handling overlapping instances and abstract spatial instructions remains challenging. To Radical-Conditioned difusion task is inherently challenging due to the non-fixed morphology of radicals across diferent characters and the need to model complex spatial and compositional relationships.

2. Methodology

Our core objective is to develop a generative model capable of synthesizing OBI characters under the guidance of specified radical components. Formally, given a set of radicals ℛ = {1, 2, ..., } and their corresponding frequency within the target character = {1, 2, ..., }, where the denotes the type of radical and the radical frequency appears, the frequency of each radical is included because repetition is semantically and structurally meaningful in oracle characters. For example, repeating a radical may imply intensity or plurality, and some characters are explicitly formed by duplicating a component. Capturing such information enables the model to generate character structures that more accurately reflect historical compositional rules. Our model aims to generate a plausible OBI character image x0 that incorporates the specified radicals with their respective frequencies. The overview of the Radical-Conditioned Difusion framework is shown in Figure 2. 2.1. Conditioned Generation via Radical-Guided Difusion We employ a Denoising Difusion Probabilistic Model [ 19 ] as the backbone generative framework. The key innovation lies in how we efectively condition the difusion process on the specified radical set (ℛ, ).

Radical Representation with Positional Context: To encode both the radical type and its positional information within the target character’s composition, we design a specialized embedding module. For each radical instance appearing times in the character, we generate two embeddings: A type embedding e ∈ R representing the semantic category of radical .A positional embedding e ∈ R representing the sequential order (where = 1, ..., ) of occurrence for that radical type within the character.These embeddings are concatenated [e ; e ] and projected into a unified conditioning vector e , ∈ R via a non-linear transformation: e, = Proj([e ; e ]) where Proj(·) denotes a projection function implemented by a multi-layer perceptron (MLP) with a GELU activation. The complete conditioning signal for the difusion model is the set of all such vectors {e, } for all radical types ∈ ℛ and their = 1, ..., occurrences. This structured embedding explicitly informs the model what radicals are needed and in what sequence they are expected to appear, capturing potential positional biases observed in oracle bone script composition.

Integration into Difusion: These conditioning vectors {e, } are integrated into the difusion model’s U-Net backbone using cross-attention layers. At each denoising step , the intermediate features of the U-Net decoder attend to the conditioning embeddings, allowing the generation process to be dynamically guided by the specified radical composition throughout the difusion trajectory.

The standard training objective for the difusion model is to minimize the noise prediction error. Specifically, during the forward process of difusion, the clean OBI character image progressively corrupted by adding Gaussian noise at timestep , resulting in x. The model aims to predict the noise added to x 0. This standard difusion loss is defined as: x0 is ℒ = Ex0,∼ (0,I), ︀[ ‖ − (x, |{e, })‖22]︀ .

Here, (x, |{e, }) denotes our conditional difusion model, which takes the noisy image x, the timestep , and the radical conditioning embeddings {e, } as input to predict the noise

2.2. Content-Aware Perceptual Loss

Standard difusion training optimizes the prediction of the noise added at step . To enhance the semantic fidelity and structural coherence of generated characters, particularly respecting the subtle morphological variations of radicals, we introduce a supplementary Content-Aware Perceptual Loss ℒ [ 20 ].

This loss operates on the predicted clean image xˆ0 at an intermediate denoising step , derived from the current noisy state x and the predicted noise ˆ : wher e ¯ is the cumulative product of the variance schedule. We extract multi-scale features (·) from diferent layers of a pre-trained feature extractor (VGG16 network) for both the true clean image x0 and the predicted xˆ0.

Crucially, ℒ weights the contribution of diferent feature levels based on the current timestep :

ℒ = 1 ∑︁ () · ‖ (x0) − (xˆ0)‖22 Here, () is a time-dependent weighting function. Low-level features (capturing edges, textures) are emphasized during high-noise stages (large ), as they are crucial for establishing the fundamental radical shapes and layout early in denoising. Conversely, high-level features (capturing semantic structures) are emphasized during low-noise stages (small ), refining the semantic coherence and fine details of the radicals and their integration as the image nears completion. is a normalization factor accounting for the number of active feature elements at each timestep. This dynamic weighting ensures the loss focuses on the most relevant visual aspects at each denoising phase, significantly improving the preservation of radical morphology and overall character integrity.

2.3. Overall Training Objective

The complete loss function for training our radical-guided difusion model combines the standard DDPM noise prediction loss ℒ , and the Content-Aware Perceptual Loss ℒ: ℒ = ℒ + ℒ where is hyperparameter balancing the contribution of each loss term. The model is trained end-to-end, learning to generate semantically coherent and structurally accurate oracle bone characters conditioned on the specified radical set and counts, while simultaneously developing rich internal representations of radical morphology and compositional rules.

3. Experiment

In this section, we evaluate our radical-conditioned difusion model for oracle character generation. Conventional metrics like FID or Inception Score are unsuitable, since oracle characters admit multiple valid forms for the same radical set. Instead, we adopt structure-aware evaluation: multi-label classification (Section 3.3), case studies (Section 3.4), and semantic embedding visualization (Figure 5) to assess whether generated characters contain the intended radicals while maintaining structural diversity and positional patterns observed in real OBI samples. 1500 trrsce a a frhC1000 o e b m u N 500 (a) Distribution of Radical Counts per Character

1898 1071 919

243 Num3ber of Radi4cals (b) Frequency Distribution of Radical Occurrence

Total: 185 Max: 555

Min: 1 00 1 2

3.1. Dataset

We construct our dataset based on the HWOBC [ 5 ] collection, which contains handwritten oracle character images covering 3,881 distinct character categories. A large portion of these categories have been structurally annotated with radical information in the Yinqi Wenyuan project. Using these annotations, we associate 185 distinct radicals with 3,767 oracle character categories, resulting in 80,823 annotated character images. Each annotated character is associated with one or more radicals that reflect its structural components. The distribution of radical annotations per category is visualized in Figure 1 and the distribution of radicals shown in Figure 3.

3.2. Implementation details

All experiments have been performed on a platform with an NVIDIA GeForce RTX 4090 graphics card. The deep learning framework used was PyTorch 2.6.1 with CUDA 12.4. The epochs and batch size have been set to 400 and 64, All models have been trained using the AdamW optimizer, with an initial and final learning rate of 0.0001, momentum of 0.937, and weight decay of 0.0005. The learning rate has adopted the warm-up cosine annealing algorithm. The learning rate of the model gradually increases within the first 3 epochs. In the experiment, nearly all models triggered early stopping around epoch 400, which means suficient training has been conducted on the whole dataset.

To evaluate the efectiveness of our radical-guided difusion model and assess the utility of the generated samples in downstream multi-label classification tasks, we construct a controlled experimental setup based on the YinQiWenYuan [ 6 ] oracle bone character dataset. Each character in the dataset is annotated with one or more radicals, and the task is formulated as multi-label classification over these radicals.

Due to the highly imbalanced nature of radical distributions and the limited number of samples for certain rare radicals, it is crucial to ensure that all radicals are observed during training while avoiding trivial memorization of character samples. To this end, we adopt a coverage-based greedy selection strategy to construct the training set. Specifically, we iteratively select samples that contribute the most previously unseen radicals until the entire radical set is covered. This ensures that the model is exposed to all radical types during training while leaving a subset of unseen character forms for evaluation.

The remaining samples are filtered to construct a test set such that each test sample contains only radicals already present in the training set. This constraint ensures that the evaluation focuses on compositional generalization rather than extrapolation to unseen radical types. Finally, we randomly sample 10% of the total dataset to form the test set, with the remaining 90% used for training. Both the radical-guided difusion model and the baseline multi-label classifier are trained on this split.

3.3. Performance analysis

To evaluate difusion-based augmentation, we generate 8,000 synthetic oracle bone character images for each radical set size (1–5) using our radical-conditioned difusion model. The synthetic data are combined with the original training set to retrain the multi-label classifier.

Importantly, the classification model used to evaluate augmentation efects is trained from scratch under two conditions: with and without the inclusion of difusion-generated data. This enables a direct comparison of the impact of synthetic data on the classifier’s performance across multiple metrics.

Table 1 summarizes the performance of the classifier before and after introducing difusiongenerated samples into the training set. Overall, the inclusion of generated samples leads to consistent improvements across all evaluation metrics. Notably, the Average Precision improves from 0.671 to 0.702, and the Subset Accuracy increases from 0.286 to 0.307. The Hamming Loss is also slightly reduced, indicating better precision in multi-label predictions.

These results demonstrate that the difusion model not only generates plausible character forms conditioned on radical sets but also introduces useful variance into the training data that benefits generalization. The improvement in recall and F1 score further suggests that the model becomes more capable of identifying rare or co-occurring radicals after exposure to the synthetic examples. This validates our hypothesis that the morphological diversity captured by the difusion model can enhance downstream classification performance. 3.4. Case study

Easy Instance Complex Instance

Input Radical

Generation Sample

Real Sample

To further evaluate the fidelity and diversity of generated samples, we conduct a qualitative case study comparing generated characters with real ones sharing the same radical components, as shown in Figure 4. For each example, we provide the input radical(s), multiple generated results, and real oracle bone characters containing those radicals.

In the Easy Instance setting, radicals are typically standalone or structurally dominant. The generated characters preserve visual similarity to real samples while capturing stylistic nuances—stroke thickness, spatial balance, and subtle morphological variations typical of handwritten oracle scripts.

In the Complex Instance setting, inputs include multiple radicals with diverse layouts and structural entanglement. The model produces plausible characters that retain radical identity and arrangement, often applying adaptive transformations—stretching, rotation, or compression—to emulate real OBI spatial strategies. These results demonstrate the model’s ability to internalize compositional flexibility and radical-level variation.

Overall, the visual results support our claim that the radical-conditioned difusion model efectively captures both the identity and adaptive behavior of radicals in context, enabling realistic and semantically meaningful character generation.

To investigate how the model organizes radical information, As shown in Figure 5, we extract the embedding vectors e, and visualize them with t-distributed Stochastic Neighbor

Embedding(t-SNE) [ 21 ]. We find that radicals with similar morphological structures cluster closely in the embedding space, suggesting the model captures structural similarity. However, this can cause confusion during generation, as the model may struggle to distinguish between similar radicals, especially those sharing stroke patterns or symmetry. In contrast, radicals with larger shape diferences are easier to cluster and preserve stably during generation, with less distortion or substitution. This implies a trade-of between embedding expressiveness and discriminability, which may be improved via contrastive regularization or additional supervision.

4. Conclusion

In this work, we propose a radical-conditioned difusion model for oracle bone character generation, which efectively captures the morphological variations and combinatorial patterns of radicals. By incorporating embeddings that encode both radical identity and positional information, the model preserves subtle visual features of radicals across diferent character contexts. Our experiments demonstrate that such a generative approach facilitates structural understanding of oracle bone script and models the compositional relationships among specific radicals. Future work may integrate structural priors or contrastive learning to further enhance radical disambiguation, and extend the task to generate corresponding glyphs from diferent historical stages using radical-based priors derived from oracle bone inscriptions.

Acknowledgments

This research was supported by the Natural Science Foundation of China (Grant No. 62506007), the Natural Science Foundation of Henan Province (Grant No. 242300420680), the Paleography and Chinese Civilization Inheritance and Development Program (Grant Nos. G1807, G1806, G2821), the Henan Province Science and Technology Research Project (Grant Nos. 242102210116, 252102321071), Major Science and Technology Project of Anyang (Grant No. 2025A02SF007) and the Henan Province High-Level Talents International Training Program (Grant No. GCC2025028). National Natural Science Foundation of China (Grant No. 62506007). Major Science and Technology Project of Anyang (Grant No. 2025A02SF007).

Declaration on Generative Al

The author(s) have not employed any Generative Al tools.

[1]

Zhao ,

Jiao , et al., Oracle bone inscriptions components analysis based on image similarity , in: 2020 IEEE 9th joint international information technology and artificial intelligence conference (ITAIC) , volume 9 , IEEE, 2020 , pp. 1666 - 1670 .

[2]

Zhang , On the improvement of the oracle radical system , Lexicographical Studies ( 2013 ) 27 - 33 . doi: 10 .16134/j. cnki.cn31-1997/g2 . 2013 . 05 .004.

[3]

Zhu , On the radicals in oracle bone inscriptions: The earliest set of pictographs in china , Journal of Ancient Books Collation and Studies ( 2002 ) 32 - 35 . In Chinese.

[4]

Wang , A Study on the Evolution of the Radical System in Oracle Bone Inscriptions, Master's thesis , Zhengzhou University, 2020 .

[5]

Li ,

Dai ,

Gao ,

Zhu ,

Li ,

Liu , Hwobc-a handwriting oracle bone character recognition database , in: Journal of Physics: Conference Series , volume 1651 ,

IOP

Publishing , 2020 , p. 012050 .

[6] Yinqi wenyuan: Oracle bone inscriptions information platform , https://jgw.aynu.edu.cn/, 2025 . Maintained by Anyang Normal University in collaboration with the Chinese Academy of Social Sciences, updated in 2025.

[7]

Jiang ,

Liu ,

Zhang ,

Chen ,

Li , Y. Han, Oraclepoints: A hybrid neural representation for oracle character , in: Proceedings of the 31st ACM International Conference on Multimedia, MM '23 , Association for Computing Machinery, New York, NY, USA, 2023 , p. 7901 - 7911 . URL: https://doi.org/10.1145/3581783.3612534. doi: 10 .1145/3581783. 3612534.

[8]

Li ,

Q.-F.

Wang ,

Zhang , K. Huang, Mix-up augmentation for oracle character recognition with imbalanced data distribution , in: Document Analysis and Recognition - ICDAR 2021 : 16th International Conference, Lausanne, Switzerland, September 5- 10 , 2021 , Proceedings, Part

, Springer-Verlag, Berlin, Heidelberg, 2021 , p. 237 - 251 . URL: https://doi.org/10.1007/ 978-3- 030 -86549-8_ 16 . doi: 10 .1007/978-3- 030 -86549-8_ 16 .

[9]

Wang ,

Deng , C.-L. Liu, Unsupervised structure-texture separation network for oracle character recognition , IEEE Transactions on Image Processing 31 ( 2022 ) 3137 - 3150 . doi: 10 .1109/TIP. 2022 . 3165989 .

[10]

Gao ,

Zhang , Y. ge Liu, Y. Han, Image translation for oracle bone character interpretation , Symmetry 14 ( 2022 ) 743 . URL: https://api.semanticscholar.org/CorpusID:247959908.

[11]

Wang ,

Cai ,

Gao ,

Feng ,

Jiao ,

Ma , Y. Jia, Study on the evolution of chinese characters based on few-shot learning: From oracle bone inscriptions to regular script , Plos one 17 ( 2022 ) e0272974 .

[12]

Wang ,

Zhang ,

Wang , S. Han, Y . Liu,

Wan ,

Guan ,

Kuang ,

Jin ,

Bai , et al., An open dataset for oracle bone character recognition and decipherment , Scientific Data 11 ( 2024 ) 976 .

[13]

Li ,

Thickstun , I. Gulrajani ,

P. S.

Liang , T. B. Hashimoto , Difusion-lm improves controllable text generation , Advances in neural information processing systems 35 ( 2022 ) 4328 - 4343 .

[14]

Wang ,

Darrell ,

S. S.

Rambhatla ,

Girdhar , I. Misra , Instancedifusion: Instance-level control for image generation , in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , 2024 , pp. 6232 - 6242 .

[15]

Rombach ,

Blattmann ,

Lorenz ,

Esser ,

Ommer , High-resolution image synthesis with latent difusion models , in: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , 2022 . URL: http://dx.doi.org/10.1109/cvpr52688. 2022 . 01042 . doi: 10 .1109/cvpr52688. 2022 . 01042 .

[16]

Nichol ,

Dhariwal ,

Ramesh ,

Shyam ,

Mishkin ,

McGrew , I. Sutskever ,

Chen , Glide: Towards photorealistic image generation and editing with text-guided difusion models ( 2021 ).

[17]

Saharia ,

Chan ,

Saxena ,

Li ,

Whang ,

E. L.

Denton ,

Ghasemipour ,

R. Gontijo

Lopes ,

B. Karagol

Ayan ,

Salimans , et al., Photorealistic text-to-image difusion models with deep language understanding , Advances in neural information processing systems 35 ( 2022 ) 36479 - 36494 .

[18]

Ramesh ,

Dhariwal ,

Nichol ,

Chu ,

Chen , Hierarchical text-conditional image generation with clip latents , arXiv preprint arXiv:2204.06125 1 ( 2022 ) 3 .

[19]

Ho ,

Jain ,

Abbeel , Denoising difusion probabilistic models , Advances in neural information processing systems 33 ( 2020 ) 6840 - 6851 .

[20]

Yang ,

Peng ,

Kong ,

Zhang ,

Yao , L. Jin, Fontdifuser: One-shot font generation via denoising difusion with multi-scale content aggregation and style contrastive learning , in: Proceedings of the AAAI conference on artificial intelligence , volume 38 , 2024 , pp. 6603 - 6611 .

[21] L. v . d. Maaten, G. Hinton, Visualizing data using t-sne , Journal of machine learning research 9 ( 2008 ) 2579 - 2605 .