-

1613-0073

Towards Pattern-based Complex Ontology Matching using SPARQL and LLM

Ondřej Zamazal

ondrej.zamazal@vse.cz 0

Ontology, Ontology Matching, Complex Ontology Matching, Large Language Model, Knowledge Graph

0 Prague University of Economics and Business , Czechia

2024

17 19

Complex ontology matching is a process to match complex structures in ontologies. While many matching tools tackle simple ontology matching, complex ontology matching is still rare. However, one entity in one ontology can be similar to a complex structure (1-to-n) or even complex structures can be on both sides (m-to-n). Therefore, the application, e.g., data integration, must consider complex correspondences within ontology alignment. Our poster paper presents a pattern-based approach where particular SPARQL queries correspond to a specific pattern, e.g., Class by Attribute Type (CAT), for its detection. SPARQL queries are anchored to entities from simple correspondences on input. Detected complex correspondence candidates are verbalized to be validated by the Large Language Model (LLM). Further, we provide a zero-shot prompting preliminary experiment and evaluation. The poster paper is equipped with the Jupyter notebook for automation of the pipeline and the full report of the experiment at: https://github.com/OndrejZamazal/ComplexOntologyMatching-SEMANTiCS2024

CEUR ceur-ws.org

1. Introduction

Sharing domain knowledge is often made via domain ontology [ 3 ]. Since diferent agents see the domain diferently, more ontologies for one domain are inevitable. To enable interoperability, Ontology Matching [ 2 ] aims at discovering relationships (e.q., equivalence) between entities of ontologies (O1, O2) called correspondences (alignment). Correspondences involve a single entity from each ontology, e.g., O1:Document=O2:Manuscript. This correspondence can enable data interoperability between systems based on O1 and O2 resp. There are ample matching tools, such as LogMap [ 6 ], to discover such correspondences. However, correspondences targeting single entities can only solve some interoperability issues. E.g., O1 has an entity Reviewer which is not explicitly in the O2, but still, O2 contains the terminology to describe the Reviewer concept. Therefore, discovering complex correspondence would further support data and schema interoperability: e.g., partly in Manchester OWL syntax, O1:Reviewer is equal to the complex concept O2:Person and (O2:authorOf some O2:Review). Complex ontology matching aims at matching complex concepts/structures (e.g., using a logic constructor) on at least one side of an ontology matching pair [ 11 ]. While ontology matching attracted ample ontology matching https://nb.vse.cz/~svabo/ (O. Zamazal)

© 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). tools and evaluation eforts (e.g., OAEI 2023 1 had ten tracks targetting simple matching), only one OAEI 2023 track focused on complex matching; even without tool participation in 2023. There have been two complex alignment benchmarks in OAEI so far: the GeoLink complex alignment benchmark [ 14 ] and the consensual dataset for complex ontology matching [ 12 ] based on conference ontologies from the OntoFarm collection [ 13 ].

2. Related Work

Several approaches have dealt theoretically with complex ontology matching, mostly patternbased approaches [ 11 ]. For example, [ 8 ] detects complex correspondences using several structural and lexical matching conditions, such as hyponyms and head-nouns of labels and their relationship. Similarly, in [ 10 ], the detection of the naming aspect played a crucial role. Our approach relies on the pattern-based, structural aspect, but we do not use the lexical aspect for detection. While corresponding structural matching conditions are usually straightforward, designing and applying lexical matching conditions is often challenging. We avoid using the lexical aspects during the detection phase, but we use verbalization and Large Language Model (LLM) for the validation step.

LLMs have already been applied for simple ontology matching. Conference ontologies have been matched using Chat-GPT [ 7 ]. While the OLaLa matching system [ 5 ] addressed more tracks within OAEI, the approach [ 4 ] targeted the Bio-ML track in OAEI. While there is still room for improvement, the initial results are promising. Recently, an approach of complex ontology alignment using LLMs has been proposed [ 1 ]. It works on the GeoLink complex alignment benchmark where two ontologies are included: the GeoLink Base Ontology (GBO) and the GeoLink Modular Ontology (GMO). Their approach is based on chain-of-thought prompting. On input there is GMO ontology and specific complex structure from GBO ontology with a prompt to give related parts in the GMO to the given complex structure from GBO. The following prompt contains parts of related content from the GMO in text and the code (module information). It makes LLM’s answer more proper. The paper primarily addresses the experiment with a very specific setting. We share a general idea that LLM could reduce the number of candidates that humans could finally validate. However, the approach [ 1 ] heavily involves human intervention in each step. On contrary, we do not include humans for prompting and we include SPARQL for detection. Next, we do not restrict complex matching on manually selected complex structure instances of one ontology on input. Our approach is instead driven by an alignment pattern.

3. Approach

We describe our pattern-based pipeline along with an example (in italics) targeting the alignment pattern Class by Attribute Type (CAT) [ 9 ]. This pattern specifies equivalence between a class in O1 and a class in O2 restricted on its scope using existential restriction; in Manchester OWL syntax: O1:Class1 EquivalentTo O2:Class1 and (O2:property some O2:Class2).

Often, ontology matching techniques limit the search space. We use simple alignment and limit complex matching to entities in complex structures related to entities from simple

1https://oaei.ontologymatching.org/2023/

alignment. We assume that potential matching complex structures involves entities from simple correspondences.

Input ontologies are conference ontologies: O1=cmt, O2=ekaw. For input alignment, we used the subset of the alignment from LogMap OAEI 2023 (O1:Paper=O2:Paper; O1:Person=O2:Person) to keep the experiment shorter. Any highly certain correspondences could be used.

Step 1. Detection is based on a structural aspect. We use a pair of SPARQL queries designed according to given alignment pattern for O1 and O2, resp.2 One alignment pattern can lead to diferent structural aspects for detection (i.e., diferent SPARQL queries) in O1 or O2, resp. Some entities in SPARQL queries are anchored based on input alignment.

Figure 1 consists of a visualization of SPARQL queries for O1 and O2 for CAT, and the correspondence from input alignment is depicted as a both-sided arrow. The detection is run for each correspondence from input alignment. In our case, 4 and 7 SPARQL results for O1 per input correspondence and 2 and 14 SPARQL results for O2 per input correspondence resp.; Example for O1 given Person (?ent1) from input correspondence: O1:Class1=Reviewer. Example for O2 given Person (?ent1) from input correspondence: O2:property1=authorOf, O2:class1=Document, O2:class2=Review.

Step 2. Results from detecting both ontologies are joined according to the alignment pattern separately per each input correspondence.

In our example, there are 4 ×2=8 complex correspondence candidates for the Paper entity from input correspondence and 14 ×7=98 complex correspondence candidates for the Person entity from input correspondence. E.g., regarding the first input correspondence, O1:Person=O2:Person, O1:Reviewer is joined with O2:Person, O2:authorOf, and O2:Review.

Step 3. Pattern-based template-driven verbalization to natural language (English) is applied to complex correspondence candidates to enable their validation using LLM. We also apply several natural language preprocessing steps, such as tokenization and lowercasing. Similarly, we use a template for serialization into Manchester OWL syntax.

For CAT, we have the following verbalization pattern into English: ”<O1:class1> is the same as <O2:ent1> which is3 <O2:property1> of <O2:class2>”. Example: Reviewer is the same as person which is author of review. 2The approach is pattern-based in two ways: because it is driven by the alignment pattern and because the SPARQL query can also be considered as a pattern for detection. 3If no ”has” exists in property1, ”is” is added.

Step 4. Finally, LLM is used to validate whether verbalized complex correspondence candidates are (probably) positives/negatives.

We experimented with diferent LLMs. So far, the best results have been achieved using GPT-4o. 4

4. Preliminary Experiment and Evaluation

The experiment is related to step 4. LLM is used to validate whether verbalized complex correspondence candidates are (probably) positives/negatives. Our online supplementary material contains the full report (SPARQL queries, their results and their joints, verbalized complex correspondence candidates, the prompt for GPT-4o,5 its results, evaluation, complex correspondences in EDOAL and Manchester OWL syntax). Of 106 candidates, 14 were labeled as negatives, 74 as probably negatives, ten as probably positives, and eight as positives. The supplementary material contains a detailed evaluation of all eight candidates labeled by LLM as positives. As a result of eight candidates’ evaluations by human, there were one true positive, one partly true positive, and six partly false positives. For the sake of brevity, here we present three evaluation examples: 1. Reviewer is the same as person which is author of review.

Although it is debatable whether being an author of a review is enough to be a real reviewer, it is certainly close enough. It was evaluated as a true positive.

2. Meta-reviewer is the same as person which is author of review.

Meta-reviewer is not only the author of the review but (s)he has a specific role within a reviewing process. A meta-reviewer is instead a subclass of person which is an author of review, i.e., Class ⊑ Class Expression. It was evaluated as a partly true positive example. 3. Author is the same as person which is author of abstract.

Since some conferences call for abstracts, being an author can be merely based on abstract authorship. However, the subsumption relation would be more fitting (i.e., Author subsumes person which is author of abstract). Since it leads to General Concept Inclusion (GCI) subsumption, i.e., Class ⊒ Class Expression, being not always allowed, it was evaluated as partly false positive.

Considering only equivalence, precision equals 18 = 0.125. However, subsumption is also important for interoperability, meaning relaxed precision ( ) could be used. If GCI axioms (partly false positives) are not allowed, = 28 = 0.25. If GCI axioms are allowed,6 = 1.0. Regarding recall, the preliminary evaluation (details in the online supplementary material) shows that all negatives are true. However, it needs further evaluation in terms of relaxed recall.

5. Conclusion and Future Work

We reported on our approach and preliminary experiments with pattern-based complex ontology matching using SPARQL for detection and verbalization before validation with LLM.

Similarly to [ 8 ], we focus on the structural aspect of detection. In contrast, we do not capture lexical aspects for detection while employing verbalization before validation by LLM. Similarly to [ 1 ], we employ LLM for complex ontology matching, but we only use it in the final step

4via https://chatgpt.com/

5We ran the GPT-4o several times with minor changes in its reply; it is not substantial for the evaluation. 6GCI subsumptions could also help with interoperability within some scenarios. of the approach for validation. Contrary to [ 1 ], we do not involve humans in prompting, complete ontology, pre-selected complex structure instances to be matched, nor manually selected additional information for complex matching.

For now, we consider a 1-to-n relationship. However, it can be generalized to an m-to-n relationship. In our experiments, we use GPT-4o. However, we will experiment more with other LLMs (e.g., Llama 3, Mixtral). Further, while we employ direct question prompting, we will also experiment with contextual question prompting, where context will be gathered automatically. We aim to explore more alignment patterns, involve more ontologies, and conduct in-depth evaluations in future work. While the provided Jupyter notebook covers three steps, the fourth step, dealing with LLM, will be implemented after further experimentation with other LLMs.

Acknowledgments

This work has been supported by the EU’s Horizon Europe grant no. 101058682 (Onto-DESIDE).

[1] Amini , R. , Norouzi , S. S. , Hitzler , P. , Amini , R. ( 2024 ). Towards Complex Ontology Alignment using Large Language Models . arXiv preprint arXiv:2404.10329 . 2024 .

[2] Euzenat , J. , Shvaiko , P. Ontology matching . In: Springer. 978-3-642-38720-3 . 2013 .

[3] Gruber , T. R. Toward principles for the design of ontologies used for knowledge sharing? In: International journal of human-computer studies , 43 ( 5 - 6 ). 1995 .

[4] He , Y. , Chen , J. , Dong , H. , Horrocks , I. Exploring large language models for ontology alignment . In: ISWC 2023 Posters, and Demos . 2023 .

[5] Hertling , S. , Paulheim , H. OLaLa: Ontology matching with large language models . In Proc. of the 12th Knowledge Capture Conference 2023 . 2023 .

[6] Jiménez-Ruiz , E. , Cuenca Grau , B. Logmap : Logic-based and scalable ontology matching . In: International Semantic Web Conference. Springer. 2011 .

[7] Norouzi , S. S. , Mahdavinejad , M. S. , Hitzler , P. Conversational Ontology Alignment with ChatGPT . In: Proc. of the Ontology Matching workshop at ISWC . 2023 .

[8] Ritze , D. , Völker , J. , Meilicke , C. , Šváb-Zamazal , O. Linguistic analysis for complex ontology matching . In: Proc. of the Ontology Matching workshop at ISWC . 2010 .

[9] Scharfe , F. Correspondence patterns representation . PhD thesis , Univ. of Innsbruck . 2009 .

[10] Šváb-Zamazal , O. , Svátek , V. Towards ontology matching via pattern-based detection of semantic structures in OWL ontologies . In: Proc. of the Znalosti conference . 2009 .

[11] Thiéblin , E. , Haemmerlé , O. , Hernandez , N. , Trojahn , C. Survey on complex ontology matching . Semantic Web , 11 ( 4 ), 2020 .

[12] Thiéblin , E. , Cheatham , M. , Trojahn , C. , Zamazal , O. A consensual dataset for complex ontology matching evaluation . The Knowledge Engineering Review , 35 : e34 . 2020 .

[13] Zamazal , O. , Svátek , V. The ten-year ontofarm and its fertilization within the onto-sphere . Journal of Web Semantics , 43 . 2017 .

[14] Zhou , L. , Cheatham , M. , Krisnadhi , A. , Hitzler , P. Geolink data set: A complex alignment benchmark from real-world ontology . Data Intelligence , 2 ( 3 ). 2020 .