1. Introduction

10.24963/IJCAI.2021

PAC Learning of Concept Inclusions for Ontology- Mediated Query Answering (Extended Abstract)

Sergei Obiedkov

Barış Sertkaya

0 0 Faculty of Compter Science and Engineering, Frankfurt University of Applied Sciences , Germany 1 Faculty of Computer Science / cfaed / ScaDS.AI, TU Dresden , Dresden , Germany

2020

12507 19 27

This extended abstract summarizes results from [1], where we propose a practical method for learning axioms in a Description Logic (DL) ontology using techniques from probably approximately correct (PAC) learning. The goal is to support ontology-mediated query answering (OMQA) [2, 3] by approximating an unknown TBox through interaction with a domain expert oracle that can decide whether a concept inclusion (CI) ⊑ is entailed by . Such an oracle may be instantiated in diferent ways-for example, as a human domain expert; a large language model (LLM); a dataset representative of the domain; or a large, complex ontology from which a smaller, focused one is to be distilled. Our method learns subsumption relationships among a finite set of concept descriptions, called the base set. This base set constrains the search space of candidate axioms and can be chosen to suit the application-e.g., all concept names, combinations of concept names with existential restrictions up to a ifxed role depth, or a tailored selection relevant to the user. We do not fix a particular DL or a set of constructors; our results apply to arbitrary DLs that support conjunction. The algorithm also employs a sampling oracle that generates CIs over according to a fixed but arbitrary distribution . Given , ∈ (0, 1) , it runs in time polynomial in the relevant parameters and returns a TBox ′ such that, with probability at least 1 − (over the algorithm's random choices), the probability (under ) that a CI over is entailed by exactly one of and ′ is at most . We also show how to direct the learning process toward subsumptions that are particularly relevant to a given ABox , by adapting the distribution . This enables the learned axioms to improve recall in query answering over incomplete datasets. Experimental evaluation on benchmark ontologies confirms the efectiveness of our approach. For related work on ontology learning in DLs, see [4, 5, 6, 7, 8, 9, 10, 11, 12].

eol>Ontologies Description Logics Active Learning Knowledge Acquisition PAC Learning

1. Introduction 2. PAC Learning of Concept Inclusions

Pr (︀ | ( |= ) ⇐⇒ ( ′ ̸|= ))︀ ≤ , using subsumption queries. We modify the equivalence oracle so that, instead of returning a model of exactly one of the two non-equivalent Horn formulas, it returns the GCI corresponding to an Horn clause that is entailed by exactly one of the two formulas. A probably approximately correct algorithm is obtained by replacing each call to this equivalence oracle with an appropriate number of calls to a suitable sampling oracle. Please see [ 1 ] for more details.

3. Varying the Query Distribution

Our definition of approximation involves a distribution of subsumption queries. This distribution is meant to reflect the interests of the user of the ontology we are trying to learn. In a basic scenario, we may assume that users explicitly pose subsumption queries to the ontology and that is the distribution of these queries.

A more practically relevant scenario is given by ontology-mediated query answering [ 2, 3 ]: Given a query (a concept description) and a knowledge base = ( , ), find all instances of in . The TBox may be only partially known or not known at all, in which case we may use our PAC algorithm to learn its approximation through interaction with an expert or from a representative dataset. In this setting, an approximation may be considered good if it ensures high precision and recall of query answering.

Definition 1. Let = ( , ) be a knowledge base, ′ be a TBox, and be a query. Using certain-answer semantics [ 2 ], we define cert(, ) as the set of individual names from satisfying |= (). The precision and recall of ′ for on are, respectively,

P( ′, ) = | cert(, ( ′, )) ∩ cert(, )| , R( ′, ) = | cert(, ( ′, )) ∩ cert(, )| .

| cert(, ( ′, ))| | cert(, )| If the denominator is 0, then the value of the corresponding measure is defined to be 1.

There are two standard ways to aggregate precision and recall for several queries: macroaveraging and microaveraging [14].

Definition 2. Let = ( , ) be a knowledge base, ′ be a TBox, and be a finite set of queries. The macro precision and recall of ′ for on are the average values of the precision and recall over all queries from :

Pmacro( ′, ) = ∑︀∈ P( ′, ) || and Rmacro( ′, ) = ∑︀∈ R( ′, ) || .

The goal in our OMQA scenario is to learn an approximation ′ of with high values of the macro/micro precision and recall for some set of queries. If ′ is a lower approximation of , then the precision for every query is 1, and so are the macro and micro precision. In this case, we aim to maximize the recall. Next we describe a heuristic approach to choosing the distribution of subsumption queries in the learning algorithm so as to increase the micro recall on a given ABox .

Consider a subsumption query d ⊑ . If we care about micro recall, it seems particularly important to ask this query whenever d has a lot of instances in 0 = (∅, ), since a positive answer to the query would then allow us to correctly assert () for many individuals . Therefore, a reasonable approach seems to be to generate the left-hand sides d of subsumption queries proportionally to | cert(d , 0)|. Regarding the right-hand sides, if () rarely occurs in , this may be due to two reasons: is a rare concept, or is a generalization of other concepts and () can be inferred from the target TBox together with what is explicitly asserted in about . We cannot tell which of the two it is; so we may want to assume the second case to be on the safe side. Then, we may want to generate the right-hand sides of subsumption queries with probabilities proportional to | Ind() ∖ cert(, 0)|, i.e., to the number of individuals that are not (yet) known to be instances of .

A problem with this approach is that ⊑ cannot be learned if has no instances in 0. To address this, we need to change the distribution on the fly, so as to take into account what has already been learned. Thus, having learned ⊑ , we update 0 by replacing 0 = ∅ with 1 = { ⊑ } and recalculate the probabilities involved in sampling premises with respect to 1 = (1, ). Now, | cert(, 1)| > 0, which makes it possible to learn ⊑ .

This was the method used in the experiments we presented in [ 1 ]. However, it prioritizes concepts d with a large number of instances in even when these instances are the same for many diferent . This may sometimes negatively afect precision or recall for certain concepts in . Instead, when sampling left-hand sides of CIs, we should try to maximize the coverage of individuals in . Therefore, in the experiments presented here, we adopt the following two-stage approach: first sample an individual from uniformly at random and then sample a subset of { ∈ | ∈ cert(, )} also uniformly

4. Experimental Evaluation

learned. We implemented our approach in a prototype tool, paclo1, and evaluated it in the OMQA context. The expert was simulated using the target TBox ; i.e., the response to a subsumption query is positive if and only if |= . Subsumption queries were answered with the ELK reasoner [15]. We set = 0.001 to ensure a high probability of obtaining the desired approximation and averaged results over five runs. Each setting is defined by a signature, an approximation type ( - or lower -approximation), and a query distribution (uniform or -induced, as described in the previous section).

We tested on six KBs: four generated with OWL2Bench [16] and two from the ORE 2015 repository [17]. Due to space constraints, we report results here only for the KB ore_ont_5596; results for the other KBs can be found in [ 1 ]. The ore_ont_5596 KB contains 58 class names, 33 role names, 322 GCIs, 112,320 individuals, 32,990 class assertions and 190,149 role assertions. Two base sets were considered: the macro and micro recall on the query set = . The results are shown in Table 1. 1 with all class names and 2 that adds ∃.⊤ for each role , yielding 93 concepts in total. We measure

For each of 1 and 2, the first line represents the precision and recall of the empty ′, i.e., the quality of query answering based only on the ABox. This serves our baseline. We omit -approximations with the uniform distribution, since it provides hardly any improvement over the baseline. The best macro and micro recall values for each are shown in bold. The column | ′| contains the number of axioms

Overall, -induced distributions yield substantially higher recall, particularly for micro recall.Lower approximations typically have slightly reduced recall, but may be preferable when perfect precision is required. Lower approximations for the uniform distribution do show some improvement over the baseline, but usually smaller than those for the -induced distribution and with larger sets of GCIs. Note that the perfect recall may not be achievable with a fixed base set , since may contain axioms mixing concepts from and outside .

Acknowledgments

This work is partly supported by DFG in project 389792660 (TRR 248, Center for Perspicuous Systems), by BMBF in ScaDS.AI, and by BMBF and DAAD in project 57616814 (SECAI).

Declaration on Generative AI

During the preparation of this work, the authors used ChatGPT in order to: Improve writing style. After using this tool, the authors reviewed and edited the content as needed and take full responsibility for the publication’s content.

[1]

Obiedkov ,

Sertkaya , PAC learning of concept inclusions for ontology-mediated query answering , International Journal of Approximate Reasoning 186 ( 2025 ) 109523 . URL: https://www. sciencedirect.com/science/article/pii/S0888613X25001641. doi:https://doi.org/10.1016/j. ijar. 2025 . 109523 .

[2]

Bienvenu ,

Ortiz , Ontology-mediated query answering with data-tractable description logics , in: W. Faber, A . Paschke (Eds.), Reasoning Web. Web Logic Rules - 11th International Summer School 2015 , Berlin, Germany, July 31 - August 4 , 2015 , Tutorial Lectures, volume 9203 of Lecture Notes in Computer Science, Springer, 2015 , pp. 218 - 307 . URL: https://doi.org/10.1007/ 978-3- 319 -21768- 0 _9. doi: 10 .1007/978-3- 319 -21768- 0 _ 9 .

[3]

Bienvenu , Ontology-mediated query answering: Harnessing knowledge to get more from data , in: S. Kambhampati (Ed.), Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence, IJCAI 2016 , New York, NY, USA, 9 - 15 July 2016 , IJCAI/AAAI Press, 2016 , pp. 4058 - 4061 . URL: http://www.ijcai.org/Abstract/16/600.

[4]

Baader ,

Ganter ,

Sertkaya , U. Sattler, Completing description logic knowledge bases using formal concept analysis , in: M. M. Veloso (Ed.), Proceedings of the Twentieth International Joint Conference on Artificial Intelligence (IJCAI'07) , AAAI Press, 2007 , pp. 230 - 235 .

[5]

Borchmann ,

Distel ,

Kriegel , Axiomatisation of general concept inclusions from finite interpretations , Journal of Applied Non-Classical Logics 26 ( 2016 ) 1 - 46 . URL: https://doi.org/10.1080/11663081. 2016 . 1168230 . doi: 10 .1080/11663081. 2016 . 1168230 . arXiv:https://doi.org/10.1080/11663081. 2016 . 1168230 .

[6]

Konev ,

Lutz ,

Ozaki ,

Wolter , Exact learning of lightweight description logic ontologies , Journal of Machine Learning Research 18 ( 2018 ) 1 - 63 . URL: http://jmlr.org/papers/v18/ 16 - 256 .html.

[7]

Ozaki , On the complexity of learning description logic ontologies , in: M. Manna , A . Pieris (Eds.), Reasoning Web. Declarative Artificial Intelligence: 16th International Summer School 2020 , Oslo, Norway, June 24-26, 2020 , Tutorial Lectures, Springer International Publishing, Cham, 2020 , pp. 36 - 52 . URL: https://doi.org/10.1007/978-3- 030 -60067- 9 _2. doi: 10 .1007/978-3- 030 -60067- 9 _ 2 .

[8]

Ozaki ,

Persia ,

Mazzullo , Learning query inseparable ℰ ℒℋ ontologies , Proceedings of the AAAI Conference on Artificial Intelligence 34 ( 2020 ) 2959 - 2966 . URL: https://ojs.aaai.org/index. php/AAAI/article/view/5688. doi: 10 .1609/aaai.v34i03. 5688 .

[9]

Funk ,

J. C.

Jung ,

Lutz , Actively learning concepts and conjunctive queries under elrontologies , in: Z. Zhou (Ed.), Proceedings of the Thirtieth International Joint Conference on