2. Related Work

Enhancing Small Open-Source Language Models for Ontology Generation through Metric-Guided Continual Pretraining

Miquel Canal-Esteve

0 0 Research Group of Language Processing and Information System, University of Alicante , Spain

2025

Ontology development requires expert knowledge and structural precision. While Large Language Models (LLMs) show promise for ontology tasks, small open-source models like Llama 3.2-1B still lack strong semantic and structural understanding. We propose a two-phase approach: continual pretraining on high-quality ontology datasets, guided by two frameworks-one for semantic metrics and another for lexical-structural metrics. We pretrained Llama 3.2-1B using semantic-based high-quality subsets and evaluated improvements through manual and structural analyses. Results show small, high-quality subsets yield rapid gains, while larger, diverse datasets improve long-term performance. Since semantic metrics need complete ontologies, ORI remains key for evaluating fragments. Future work will apply instruction- and fine-tuning for specialized tasks such as those in the LLMs4OL benchmark or for generating structured resources across domains using ontology-based methods. This work shows that thoughtful data selection and continual pretraining can push small LLMs toward expert-level ontology generation.

eol>Ontology generation continual pretraining semantic metrics large language models

2. Related Work

Ontology-to-ontology generation remains a largely untapped area, especially when considering the capabilities of large language models (LLMs). While most prior research focuses on applying LLMs to tasks like ontology refinement, enrichment, or generation from unstructured text sources [ 6, 7, 11 ], these works primarily provide the groundwork upon which this study builds.

Recent research highlights the diverse roles LLMs can play in ontology engineering. For instance, Zhao et al. [ 12 ] incorporate OntoClean principles to improve refinement; Toro et al. [ 13 ] leverage retrieval-augmented generation (RAG) in DRAGON-AI for dynamic construction; Fathallah et al. [ 14 ] address structured translation using NeOn-GPT; Zhang et al. [ 15 ] develop conversational approaches with OntoChat; He et al. [ 16 ] apply deep learning for ontology completion in DeepOnto; and Mukanova et al. [ 17 ] use LLMs for enrichment tasks. Yet, none of these approaches directly tackle the challenge of autonomously generating new ontologies from partial or incomplete inputs.

Additional research on text-to-ontology generation also informs our approach. Babaei et al. [ 7 ] break down the generation process into subtasks and emphasize the importance of fine-tuning; Saeedizade et al. [ 11 ] guide progressive ontology construction through competency questions; and Da Silva et al. [ 18 ] demonstrate how few-shot prompting can enhance generation outcomes. By contrast, our work adopts continual pretraining to help the model internalize domain-relevant patterns and knowledge prior to task-specific fine-tuning, thus reducing dependence on prompt-driven examples.

Moreover, the importance of data cleaning for improving model performance is well established across the literature [ 19, 20 ], with several studies introducing sophisticated filtering methods during preprocessing [ 21, 22, 23 ]. However, these methods are predominantly designed for unstructured text, and to date, no standardized methodology exists for selecting high-quality data specifically for continual pretraining in the context of ontologies. Our study addresses this gap by proposing a systematic approach to identify and leverage high-quality ontological data for improving LLM performance.

3. Description of the Proposed Research

This research is structured in two main phases. First, we focus on improving the models’ general semantic and ontology knowledge through continual pretraining on ontology datasets. For this, we have developed two dedicated repositories to measure the quality of datasets: one for computing semantic metrics 1 (covering classes, taxonomic and non-taxonomic relations) and another for computing lexical and structural metrics 2 (focused on vocabulary usage and structural patterns). The goal is to combine these into a global metric that can evaluate the quality of both the datasets and the outputs generated by the model, helping to identify high-quality subsets — since it is well known that less data of higher quality is more efective for training. Some experiments have already been carried out in this direction.

Second, we will apply instruction-tuning and fine-tuning to adapt the continually pretrained models for specific, high-value ontology tasks, such as those defined in the LLMs4OL [ 7 ] benchmark — including term typing, type taxonomy discovery, and type non-taxonomic relation extraction. Together, these eforts aim to systematically enhance small, open-source LLMs’ ability to handle advanced ontology engineering challenges and structured knowledge tasks more efectively than base models. This tasks can help to automatize the creation of didacic material based on ontologies.

4. Methodology

This section details the methodological framework designed to improve small open-source language models for ontology generation. We combine curated ontology repositories, carefully designed semantic and lexical-structural metrics, and a continual pretraining strategy. By integrating dataset selection, metric-driven evaluation, and robust training configurations, our approach systematically enhances

1https://github.com/miquelcanalesteve/LLM4Onto

2https://github.com/miquelcanalesteve/ontology-metrics-pretraining the model’s semantic, lexical, and structural capabilities. Below, we describe the ontology sources, the metric systems, the segmentation strategies, the evaluation framework, and the pretraining setup.

4.1. Ontology Repository

We base our methodology on an ontology repository that provides structured knowledge for pretraining. These repositories include ontologies of varying size, completeness, and semantic richness, requiring additional filtering before use.

For this study, we selected DBpedia Archivo3, a widely used repository of ontologies across diverse domains [ 24 ]. Its files are provided in TTL (Turtle) format, a popular and human-readable RDF serialization, making it ideal for evaluating how quality-based dataset selection afects model performance.

Our dataset, downloaded on July 15, 2024, includes 1,766 ontologies totaling 71 million triples, ranging from small sets under 10 triples to large ontologies exceeding 10 million.

4.2. Lexical and Structural Ontology Metrics

This section introduces the text-based metrics designed to quantify vocabulary usage, lexical richness, and structural variability in ontology files.

To evaluate an ontology, we assess its raw text representation—whether it comes from an existing dataset or is generated by a model—using lightweight text-based metrics inspired by Palomar et al. [ 19 ]. These metrics capture vocabulary use and structural diversity without requiring reasoning or formal parsing. We then aggregate them into the Ontology Reference Index (ORI), drawing on concepts from [ 25 ], to support data ranking and performance evaluation.

Vocabulary-specific density. Average number of predefined vocabulary terms per non-empty line (dependent on typical one-relation-per-line format):

where is the number of non-empty lines, and is the number of vocabulary terms detected in line . The vocabulary is a predefined set of ontology modelling terms commonly used across structured knowledge representations, including those from RDF, RDFS, OWL, and XSD (the full vocabulary is available in the repository). Terms inside quoted literals are excluded.

Vocabulary-specific diversity. Proportion of vocabulary terms that appear at least once in the file: den = 1 ∑︁ =1 div = |doc|

|spec| where doc ⊆ spec is the set of vocabulary terms found in the file, and spec is the same vocabulary used for den. A higher value indicates broader use of available modeling constructs. Logical block uniqueness ratio (LBUR). Fraction of unique logical blocks in the ontology: LBUR = |unique_blocks|

|blocks|

Logical blocks are defined as minimal self-contained RDF/OWL units, starting from a subject and continuing until the terminating period. These typically include class declarations, property assertions, or grouped triples. Line uniqueness ratio (LUR). Fraction of unique non-empty lines:

LUR = |unique_nonempty_lines|

|nonempty_lines|

This metric captures surface-level textual redundancy, regardless of line type (structural, directive, or annotation).

Brunet Index (BI). Lexical richness index:

BI = −0.165 where is the total number of word tokens and is the number of unique word types. Composite terms (e.g., prefix-based identifiers) are tokenized accordingly. Lower values indicate greater lexical diversity.

Ontology Reference Index (ORI) and Evaluation The Ontology Reference Index (ORI) provides a weighted measure of an ontology’s alignment with an idealized reference, which aggregates the best observed values for each of the five previously defined metrics. This reference does not represent any single ontology but instead reflects the per-metric maxima identified across the dataset.

The computation normalizes all metric values using min-max scaling. Because lower Brunet Index values indicate better lexical diversity, the method inverts this metric using 1 − () , where () denotes the normalized value. The ORI score is then calculated as: where = {, , , , }, and

∈ = ∑︁ () * ()

() = {︃(),

if ̸= 1 − (), if =

The weight assigned to each metric reflects the performance gap between the base model (Llama 3.2-1B) and the top-performing ontology for that metric. For the Brunet Index, the method computes this ratio inversely (base / best) to maintain consistency with its inverted interpretation. The procedure then normalises these gains to derive the final weights, which appear in Table 1.

LBUR LUR BI Llama 3.2-1B 0.500 0.035 0.955 0.738 16.11 Top-1 dataset 1.257 0.622 1 1 4.382 Gain 2.514 17.744 1.048 1.345 3.679

Weights 0.096 0.673 0.040 0.051 0.140

To estimate base model values, we sampled 12 ontology fragments of 150 tokens each and generated 6 completions of 450 tokens per fragment. The generation used the following configuration: do_sample=True, top_k=50, top_p=0.95, and temperature=0.7. The trained models followed the same ontology completion protocol.

4.3. Semantic Metrics

To evaluate the structural quality of an ontology, we propose lightweight complexity-based metrics inspired by Tello et al. [ 26 ] and Gutiérrez et al. [ 27 ]. These metrics quantify the richness and density of the ontology without requiring reasoning or formal entailment, making them scalable for large repositories. We then aggregate them into a unified quality score to support dataset filtering and model evaluation.

Average Subclasses per Class (SC). Average number of subclasses per class, reflecting the hierarchical depth and granularity of the ontology taxonomy: Average Non-Taxonomic Relations per Class (NTR). Average number of non-taxonomic relationships per class, indicating the density of semantic links beyond simple hierarchies: ∑︀=1 not() = where not() is the number of non-taxonomic relationships attached to class .

Property Density (PD). Average number of attributes and non-taxonomic relations per class, serving as a proxy for schema richness and information density:

∑︀=1(att() + not()) = where att() denotes the number of data properties (attributes) of class .

To consolidate these aspects into a unified quality score , we normalize the three metrics using min-max scaling and compute:

= ( ) + () + ( )

These metrics are computed using the rdflib Python library, providing an eficient and reproducible basis for ontology quality analysis. 4.3.1. Segmentation of Datasets While the segmentation approach described here is applied using the metric, the same logic could be extended to the Ontology Reference Index (ORI) or other metric, allowing future work to explore dataset splits that prioritize lexical and structural quality alongside semantic complexity. For this study, however, segmentation is based solely on , which focuses on semantic richness, density, and hierarchy.

To segment the dataset, we first compute the token count for each ontology. This allows us to define partitions based on token distribution, ensuring that diferent subsets capture varying levels of quality and diversity (i.e., more ontologies lead to greater diversity). While segmentation can be done in multiple ways—by quartiles, deciles, halves, or other thresholds—we adopt three specific strategies: 1. Q1 (Prioritizing Quality): Ontologies are ranked by , and those with the highest scores are selected until reaching at least 25% of the total tokens. Since selection is done without truncation, the last ontology added may slightly exceed this threshold. In our case, this resulted in 31% of the total tokens. 2. Q1,2 (Quality + Diversity): Ontologies are again ranked by , and selection continues until reaching at least 50% of the total tokens. This strategy balances quality and diversity while ensuring that no ontology is arbitrarily truncated. 3. Q1-4 (Full Dataset): This set includes all available ontologies, covering the entire range of quality levels and structural complexities. It serves as a baseline to assess the impact of training on the full, unfiltered dataset.

This segmentation enables a systematic assessment of how training on subsets with varying semantic quality afects model performance. Table 3 summarizes the selected datasets, showing the average values and standard deviations for each key quality metric.

4.4. Manual Evaluation

The manual evaluation framework is based on da Silva et al. [ 18 ], which categorizes errors into syntactic, semantic, and structural issues to comprehensively assess ontology quality. Additional criteria follow Chen et al. [ 28 ] to address ambiguity and redundancy, and Xu et al. [ 29 ] to capture text repetition. Errors include syntactic violations (e.g., missing delimiters), triplet repetition, text repetition within comments or literals, semantic redundancy, ambiguity between entities, semantic contradictions (e.g., conflicting OWL types), and vocabulary misuse involving incorrect ontology terms. A complete guide for the evaluation is available in the repository 4.

To quantify performance, we compute the mean error rate per triple across categories, following da Silva et al. [ 18 ]. The evaluation uses unseen ontology fragments drawn from diverse repositories such as AGRO5, EDAM6, MDS7, and SWEET8, covering domains like biology, spatial data, and agriculture. Each fragment (150 tokens) was randomly sampled and generated six times using the Hugging Face library with do_sample=True, top_k=50, top_p=0.95, and temperature=0.7, ensuring robust and unbiased measurement of generalization capabilities.

4.5. Pretraining LLM

For continual pretraining, we used the Llama 3.2-1B model [ 8 ], chosen for its compact yet expressive 1.2 billion parameter architecture, which balances adaptability and computational eficiency. The TTL ontologies were processed as plain text, allowing standard NLP tokenization without specialized parsing, resulting in 1.25 billion tokens across all subsets. To scale training efectively, we applied a Distributed Data Parallel (DDP) strategy [ 30 ], distributing parameters and gradients across four NVIDIA A100 GPUs (40 GB each), with gradient accumulation and checkpointing to optimize batch size. Each dataset subset was pretrained for two epochs, as longer runs showed no additional gains.

5. Experiments

The results reported in Table 2 use quartiles selected based on semantic quality metrics, which guided the segmentation of the dataset into high-quality (Q1), top-half (Q1,2), and full (Q1...4) subsets. While the same segmentation logic could, in principle, be applied using the Ontology Reference Index (ORI) to emphasize lexical and structural quality, in this study it was only tested with the metric, which focuses on semantic richness, density, and hierarchy.

Importantly, cannot be applied to model-generated outputs because these are only partial ontology fragments, and the rdflib library requires complete, parsable ontology structures to compute these metrics, whereas remains applicable because it operates directly over the raw text representation.

4https://github.com/miquelcanalesteve/LLM4Onto/tree/main/results

5https://bioportal.bioontology.org/ontologies/AGRO 6https://bioportal.bioontology.org/ontologies/EDAM 7https://matportal.org/ontologies/MDS 8https://earthportal.eu/ontologies/SWEET

Model Ep Base Q1 1 Q1 2 Q1,2 1 Q1,2 2 Q1...4 1 Q1...4 2

The base Llama 3.2-1B model shows a high total error rate of 6.6%, mainly driven by repetition errors (30.7%) and syntactic issues (3.0%). Semantic and vocabulary-specific errors are almost negligible in the base outputs, reflecting structurally shallow generations. Pretraining on the high-quality subset (Q1) for one epoch sharply reduces the total error rate to 1.6%, with substantial improvements in repetition (4.4%) and syntactic errors (0.6%). A second epoch on Q1 slightly increases total errors to 2.4%, suggesting diminishing returns or mild overfitting.

Expanding the training to larger subsets, such as Q1,2 or the full dataset (Q1...4), stabilizes error rates between 2.4% and 2.5%, with the best redundancy and text repetition reduction achieved under the Q1,2 (2 epochs) and Q1...4 (1 epoch) settings. These configurations show that simply increasing dataset size or epochs does not linearly improve performance, making it crucial to calibrate training parameters carefully.

6. Conclusions and Future Work

Overall, the results demonstrate that continual pretraining meaningfully boosts the model’s ability to generate coherent, semantically aligned, and structurally rich ontologies. While small, high-quality subsets like Q1 enable rapid improvements, broader datasets like Q1...4 maximize long-term structural gains, provided training configurations are carefully balanced to avoid performance plateaus. These ifndings highlight the need to rethink data selection strategies: although this study segmented data using semantic metrics, future work should explore integrating lexical and structural dimensions into a combined metric. Such a mixed metric could help isolate subsets that ofer the best balance between semantic depth, lexical richness, and structural complexity, potentially driving even more robust model improvements.

Additionally, the evaluation framework itself presents an opportunity for advancement. The current manual assessment, while informative, is labor-intensive and limits scalability; developing an automated evaluation pipeline would not only streamline the process but also enhance reproducibility and allow ifner-grained analysis across larger test sets. Looking ahead, the next research phase will apply instruction tuning and task-specific fine-tuning, aligning pretrained models with specialized ontology tasks such as those outlined in the LLMs4OL [ 7 ] benchmark, as well as expanding applications across diverse domains like education and biomedicine. Together, these steps aim to move small, open-source LLMs beyond general improvements toward expert-level performance in key ontology engineering applications.

Declaration on Generative AI

During the preparation of this work, the authora used ChatGPT in order to: Grammar and spelling check, Paraphrase, translate and reword. After using this tool/service, the authors reviewed and edited the content as needed and takes full responsibility for the publication’s content.

[1]

Fernández-López ,

Gómez-Pérez ,

Juristo , Methontology: from ontological art towards ontological engineering ( 1997 ).

[2]

Poveda-Villalón ,

Fernández-Izquierdo ,

Fernández-López ,

García-Castro , Lot: An industrial oriented ontology engineering framework , Engineering Applications of Artificial Intelligence 111 ( 2022 ) 104755 .

[3]

Lambrix ,

Armiento ,

Li ,

Hartig ,

Abd Nikooie Pour ,

Li , The materials design ontology , Semantic Web ( 2024 ) 1 - 35 .

[4]

Lu , G. Song,

Li ,

Wang , Development of an ontology for construction carbon emission tracking and evaluation , Journal of Cleaner Production 443 ( 2024 ) 141170 .

[5]

Amalki ,

Tatane ,

Bouzit , Deep learning-driven ontology learning: A systematic mapping study , Engineering, Technology & Applied Science Research 15 ( 2025 ) 20085 - 20094 .

[6]

Du ,

An ,

Wang ,

Liu , A short review for ontology learning: Stride to large language models trend , arXiv preprint arXiv:2404.14991 ( 2024 ).

[7]

Babaei Giglou , J. D'Souza , S. Auer, Llms4ol: Large language models for ontology learning , in: International Semantic Web Conference, Springer, 2023 , pp. 408 - 427 .

[8] Llama , Model cards and prompt formats - llama 3 .2, https://www.llama.com/docs/ model-cards-and - prompt-formats/llama3_2/, 2024 . Accessed: 2025 -03-04.

[9]

Team ,

Kamath ,

Ferret ,

Pathak ,

Vieillard ,

Merhej ,

Perrin ,

Matejovicova ,

Ramé ,

Rivière , et al., Gemma 3 technical report, arXiv preprint arXiv:2503.19786 ( 2025 ).

[10]

Biderman ,

Schoelkopf ,

Q. G.

Anthony ,

Bradley , K. O'Brien , E.

Hallahan , M. A.

Khan , S.

Purohit , U. S.

Prashanth , E.

Raf , et al., Pythia: A suite for analyzing large language models across training and scaling , in: International Conference on Machine Learning, PMLR , 2023 , pp. 2397 - 2430 .

[11] M. J. Saeedizade , E. Blomqvist, Navigating ontology development with large language models , in: European Semantic Web Conference , Springer, 2024 , pp. 143 - 161 .

[12]

Zhao ,

Vetter ,

Aryan , Using large language models for ontoclean-based ontology refinement , arXiv preprint arXiv:2403.15864 ( 2024 ).

[13]

Toro ,

A. V.

Anagnostopoulos ,

S. M.

Bello ,

Blumberg ,

Cameron ,

Carmody ,

A. D.

Diehl ,

D. M.

Dooley ,

W. D.

Duncan ,

Fey , et al., Dynamic retrieval augmented generation of ontologies using artificial intelligence (dragon-ai) , Journal of Biomedical Semantics 15 ( 2024 ) 19 .

[14]

Fathallah , A. Das , S. D.

Giorgis , A.

Poltronieri , P.

Haase , L. Kovriguina, Neon-gpt: a large language model-powered pipeline for ontology learning , in: European Semantic Web Conference , Springer, 2024 , pp. 36 - 50 .

[15]

Zhang ,

V. A.

Carriero ,

Schreiberhuber ,

Tsaneva ,

L. S.

González ,

Kim , J. de Berardinis, Ontochat: a framework for conversational ontology engineering using language models , in: European Semantic Web Conference , Springer, 2024 , pp. 102 - 121 .

[16]

He ,

Chen ,

Dong , I. Horrocks,

Allocca ,

Kim ,

Sapkota , Deeponto: A python package for ontology engineering with deep learning , Semantic Web 15 ( 2024 ) 1991 - 2004 .

[17]

Mukanova ,

Milosz ,

Dauletkaliyeva ,

Nazyrova ,

Yelibayeva ,

Kuzin , L. Kussepova, Llm-powered natural language text processing for ontology enrichment ., Applied Sciences (2076- 3417) 14 ( 2024 ).

[18] L. M. V. da Silva ,

Kocher ,

Gehlhof ,

Fay , On the use of large language models to generate capability ontologies , in: 2024 IEEE 29th International Conference on Emerging Technologies and Factory Automation (ETFA) , IEEE, 2024 , pp. 1 - 8 .

[19]

Palomar-Giner ,

J. J.

Saiz ,

Espuña ,

Mina ,

S. Da

Dalt ,

Llop ,

Ostendorf ,

P. O.

Suarez ,

Rehm ,

Gonzalez-Agirre , et al., A curated catalog: Rethinking the extraction of pretraining corpora for mid-resourced languages , in: Proceedings of the 2024 Joint International Conference on Computational Linguistics , Language Resources and Evaluation (LREC-COLING 2024 ), 2024 , pp. 335 - 349 .

[20]

P. J. O.

Suárez ,

Sagot , L. Romary, Asynchronous pipeline for processing huge corpora on medium to low resource infrastructures , in: 7th Workshop on the Challenges in the Management of Large Corpora (CMLC-7) , Leibniz-Institut für Deutsche Sprache , 2019 .

[21]

Chen ,

Liu ,

Chen ,

Zhu ,

Wu ,

Yu , Opal: Ontology-aware pretrained language model for end-to-end task-oriented dialogue , Transactions of the Association for Computational Linguistics 11 ( 2023 ) 68 - 84 .

[22]

Conneau ,

Khandelwal ,

Goyal ,

Chaudhary ,

Wenzek ,

Guzmán , E. Grave,

Ott ,

Zettlemoyer ,

Stoyanov , Unsupervised cross-lingual representation learning at scale , arXiv preprint arXiv: 1911 . 02116 ( 2019 ).

[23]

Kudugunta ,

Caswell ,

Zhang ,

Garcia ,

Xin ,

Kusupati ,

Stella ,

Bapna ,

Firat , Madlad-400: A multilingual and document-level large audited dataset , Advances in Neural Information Processing Systems 36 ( 2023 ) 67284 - 67296 .

[24]

Frey ,

Streitmatter ,

Götz ,

Hellmann ,

Arndt , Dbpedia archivo: a web-scale interface for ontology archiving under consumer-oriented aspects, Semantic Systems . In the Era of Knowledge Graphs 12378 ( 2020 ) 19 .

[25]

Alani ,

Brewster , Metrics for ranking ontologies , in: Proceedings of the Evaluating Ontologies for the Web Workshop (EON2006) , 15th International World Wide Web Conference, EON Workshop, Edinburgh, Scotland, 2006 .

[26]

A. J. L.

Tello , Métrica de idoneidad de ontologías, Ph.D. thesis , Universidad de Extremadura, 2002 .

[27]

Gutierrez ,

Tomas , I. Moreno , Developing an ontology schema for enriching and linking digital media assets , Future Generation Computer Systems 101 ( 2019 ) 381 - 397 .

[28]

Chen , G. Cao,

Chen ,

Ding , A practical framework for evaluating the quality of knowledge graph, in: Knowledge Graph and Semantic Computing: Knowledge Computing and Language Understanding: 4th China Conference , CCKS 2019 , Hangzhou, China, August 24-27 , 2019 , Revised Selected Papers 4 , Springer, 2019 , pp. 111 - 122 .

[29]

Xu ,

Liu ,

Yan ,

Cai ,

Li ,

Li , Learning to break the loop: Analyzing and mitigating repetitions for neural text generation , Advances in Neural Information Processing Systems 35 ( 2022 ) 3082 - 3095 .

[30]

Duan ,

Zhang ,

Wang ,

Jiang ,

Qu ,

Hu ,

Wang ,

Weng ,

Yan ,

Zhang , et al., Eficient training of large language models on distributed infrastructures: a survey, arXiv preprint