=Paper=
{{Paper
|id=Vol-3739/abstract-20
|storemode=property
|title=Probably Approximately Correct Ontology Completion with pacco (Extended Abstract)
|pdfUrl=https://ceur-ws.org/Vol-3739/abstract-20.pdf
|volume=Vol-3739
|authors=Sergei Obiedkov,Barış Sertkaya
|dblpUrl=https://dblp.org/rec/conf/dlog/ObiedkovS24
}}
==Probably Approximately Correct Ontology Completion with pacco (Extended Abstract)==
Probably Approximately Correct Ontology Completion with pacco (Extended Abstract) Sergei Obiedkov1 , Barış Sertkaya2 1 Knowledge-Based Systems Group, TU Dresden, Dresden, Germany 2 Frankfurt University of Applied Sciences, Frankfurt, Germany Keywords Ontology learning, Angluin’s learning framework, PAC learning, 1. Introduction Traditionally, ontology construction is conducted by knowledge engineers whose task is to formalize relevant concepts of the application domain and the relationships among them. This process is tedious and error-prone, and thus various methods to facilitate construction and, to some extent, ensure completeness of the resulting ontology by querying a domain expert have been investigated. For instance, learnability of axioms in lightweight DLs from a given set of interpretations has been studied in [1], and exact learnability of lightweight DL ontologies in Angluin’s framework via queries has been studied in [2]. In [3, 4, 5, 6, 7], learning DL concepts and axioms from given examples has been investigated. Learning a DL ontology that is inseparable from a target ontology w.r.t. a query language and a fixed dataset via querying an oracle has been investigated in [8]. In a recent work [9], PAC learning of DL concepts has been studied. For an overview on learning DL ontologies see [10, 11]. In another line of work, Formal Concept Analysis (FCA) [12] was employed for mining an ℰℒ⊥ -basis of all axioms holding in a given model [13, 14]. In [15], a further approach for mining bases with a predefined and fixed role depth of concept expressions was proposed. In a more recent work [16], an approach for efficient mining of axioms from graph dataset was introduced and evaluated on large datasets. The FCA-based approaches have the disadvantage that, in the worst case, they issue exponentially many queries to the domain expert. In [17], we presented an approach combining the advantages of both lines of research for completing the missing information in the ontology w.r.t. some given set of concept descriptions via issuing only polynomially many (w.r.t. the relevant quantities) queries to an expert. In the present work, we improve and implement the approach from [17] and present a PAC version of the ontology-completion method, which was initially introduced in [18] and implemented in [19]. Our solution is based on an algorithm for PAC learning a Horn envelope of an arbitrary propositional formula [20], which is itself based on an algorithm for exactly learning Horn formulas [21]. Our setting is the following. Given: • an initial TBox 𝒯0 ; • an expert ℰ able to answer subsumption queries w.r.t. a TBox 𝒯ℰ that is unknown to us; • a set 𝒞 of concept descriptions built over the signature of 𝒯ℰ ; • a sampling oracle 𝒰 that, when called, returns a subsumption query over 𝒞 according to a probability distribution 𝒟𝒰 ; • a probability 𝛿 and error bound 𝜖 with 0 < 𝜖, 𝛿 < 1; compute a TBox 𝒯 such that 𝒯0 ⊆ 𝒯 and 𝒯 is, with probability at least 1 − 𝛿, an 𝜖-approximation of 𝒯ℰ . Building on our work [17], we introduce pacco1 , a tool for probably approximately correct (PAC) completion of DL TBoxes. DL 2024: 37th International Workshop on Description Logics, June 18–21, 2024, Bergen, Norway $ sergei.obiedkov@tu-dresden.de (S. Obiedkov); sertkaya@fb2.fra-uas.de (B. Sertkaya) 0000-0003-1497-4001 (S. Obiedkov); 0000-0002-4196-0150 (B. Sertkaya) © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). 1 https://github.com/sertkaya/pacco CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings Definition 1. If 𝒯 is a TBox and 𝒞 is a finite set of concept descriptions, we denote by Impl(𝒯 , 𝒞) = {𝑋 → 𝑌 | 𝑋, 𝑌 ⊆ 𝒞 and 𝒯 |= ⊓𝑋 ⊑ ⊓𝑌 } the set of implications corresponding to GCIs over conjunctions of concepts from 𝒞 entailed by 𝒯 . Our approach to solving the above problem w.r.t. 𝒯0 , 𝒞, and ℰ is to learn a set ℒ of implications that approximates Impl(𝒯 ℰ , 𝒞). For this, we need an oracle capable of answering membership and equivalence queries with respect to Impl(𝒯 ℰ , 𝒞). A membership query asks whether a certain subset of 𝒞 is a model of Impl(𝒯 ℰ , 𝒞), and an equivalence query asks for a subset of 𝒞 that is a model of either Impl(𝒯 ℰ , 𝒞) or ℒ but not both. We rely on a (domain) expert oracle ℰ answering subsumption queries over 𝒞 of the form 𝒯 ℰ |= ⊓𝑋 ⊑ 𝐶, where 𝑋 ⊆ 𝒞 and 𝐶 ∈ 𝒞. 2. Probably Approximately Correct Completion As a measure of approximation for TBoxes, we consider how often the two TBoxes give the same answers to subsumption queries: Definition 2. Let 𝒯1 and 𝒯2 be two TBoxes, 𝒞 be a finite set of concept descriptions, and 𝒟 be a probability distribution of subsumption queries over 𝒞. We define dist𝒟 𝒞 (𝒯1 , 𝒯2 ), the 𝒞-𝒟-distance between 𝒯1 and 𝒯2 , as the probability of getting different responses to a subsumption query w.r.t. 𝒯1 and 𝒯2 : Pr𝒟 (𝑄 | (𝒯1 |= 𝑄) ⇔ (𝒯2 ̸|= 𝑄)), where 𝑄 is a subsumption query over 𝒞. For a given 0 < 𝜖 < 1, we call a TBox 𝒯 an 𝜖-𝒞-𝒟-approximation of 𝒯 * if dist𝒟 𝒞 (𝒯 , 𝒯 ) ≤ 𝜖. If, in addition, * 𝒯 |= 𝒯 , i.e., the set of models of 𝒯 is a subset of the set of models of 𝒯 , then we say that 𝒯 is an * * upper 𝜖-𝒞-𝒟-approximation of 𝒯 * . Proposition 1. If 𝒯 is an upper 𝜖-𝒞-𝒟-approximation of 𝒯 * , then the 𝒞-𝒟-distance between 𝒯 and 𝒯 * is equal to Pr𝒟 (𝑄 | 𝒯 ̸|= 𝑄 and 𝒯 * |= 𝑄). Our algorithm has access to a sampling oracle 𝒰 simulating a user that, when called, returns a subsumption query over 𝒞 according to a probability distribution 𝒟𝒰 . Given parameters 0 < 𝜖, 𝛿 < 1, it uses oracles ℰ and 𝒰 to compute a TBox 𝒯 that, with probability at least 1 − 𝛿, is an upper 𝜖-𝒞-𝒟𝒰 - approximation of 𝒯 ℰ . The algorithm internally maintains a list of implications over 𝒞 such that, after termination, the corresponding set of GCIs is a desired approximation of 𝒯 ℰ . It starts with the empty list ℒ of implications and uses counterexamples obtained from (simulated) equivalence queries to update ℒ. In the algorithms from [21] and [20], both positive and negative counterexamples are possible. A positive counterexample may be returned if the current hypothesis contains an implication not entailed by the target Horn formula. In our setting, such an implication corresponds to a CGI not entailed by the target TBox 𝒯 ℰ . Since our goal is to obtain an upper approximation of 𝒯 ℰ , we do not allow such GCIs. Therefore, we modify the algorithm to make sure that ℒ always contains only implications corresponding to GCIs entailed by 𝒯 ℰ . This ensures that the resulting approximation of 𝒯 ℰ is an upper approximation. We simulate every equivalence query with several calls to the sampling oracle 𝒰. If not all valid GCIs are entailed by GCI(ℒ) = {⊓𝑋 ⊑ ⊓𝑌 | 𝑋 → 𝑌 ∈ ℒ}, we expect an equivalence query to return a subset 𝑋 of 𝒞 closed under ℒ but not under Impl(𝒯 ℰ , 𝒞). Since we aim at an 𝜖-approximation, we must be able to obtain, with an appropriate probability, such 𝑋 whenever dist𝒟 ℰ 𝒞 (GCI(ℒ), 𝒯 ) > 𝜖. Every iteration of the algorithm ⌈︁ starts with a⌉︁ search for a counterexample 𝑋 to GCI(ℒ). For the 𝑖th iteration, our algorithm makes log1−𝜖 𝑖(𝑖+1) 𝛿 attempts to generate a counterexample [22]. Since ℒ contains only valid implications, the counterexample, if found, is always negative and is a model of ℒ. The rest of the iteration eliminates some counterexample 𝑌 ⊆ 𝑋 by making sure that, for every 𝐶 ∈ 𝒞, the responses to the query ⊓𝑌 ⊑ 𝐶 w.r.t. GCI(ℒ) and 𝒯ℰ are identical. To do this, the algorithm searches ℒ for the first implication 𝑈 → 𝑉 such that 𝑌 = 𝑈 ∩ 𝑋 ̸= 𝑈 and 𝒯 ℰ |= ⊓𝑌 ⊑ 𝐶 for some 𝐶 ∈ 𝑉 ∖ 𝑌 . If such an implication is found, it is refined to 𝑌 → Compl(𝑌 ), where Compl(𝑌 ) is the completion of 𝑌 w.r.t. 𝒞 and 𝒯 ℰ , i.e., Compl(𝑌 ) = {𝐶 ∈ 𝒞 | 𝒯 ℰ |= ⊓𝑌 ⊑ 𝐶}. Otherwise, a new implication 𝑋 → Compl(𝑋) is added to the end of ℒ. In both cases, the new implication is valid w.r.t. Number of Execution 𝛿 𝜖 samples queries generated axioms unrecovered axioms time (sec.) 0.01 0.01 17502 18628 28 33 283 0.1 3467 6474 23 35 147 0.2 1690 4142 18 38 67 0.3 359 1226 6 46 10 0.1 0.01 14456 16172 27 32 254 0.1 3347 6431 22 34 132 0.2 1084 2719 10 43 44 0.2 745 2458 9 44 34 0.2 0.01 14965 16371 26 36 221 0.1 3338 6383 22 36 128 0.2 502 1533 6 47 16 0.3 209 773 3 48 5 0.3 0.01 14332 16039 26 34 226 0.1 3560 6923 22 34 140 0.2 1375 3735 15 39 56 0.3 19 88 1 49 <1 Table 1 Evalutation results completing GO-Plant. All values are rounded arithmetic means of 5 different runs. 𝒯 ℰ . Note that the computation of completion requires 𝑂(|𝒞|) queries to the oracle ℰ and no calls to the sampling oracle 𝒰. Combining this with the results in [20], we obtain Theorem 1. Given a domain expert oracle ℰ and a sampling oracle 𝒰, our algorithm computes, with probability at least 1 − 𝛿, an upper 𝜖-𝒞-𝒟-approximation of 𝒯 ℰ . The number of queries to ℰ and 𝒰 posed by the algorithm is polynomial in |𝒞|, 1/𝜖, 1/𝛿, and the minimal size of an implication set equivalent to Impl(𝒯 ℰ , 𝒞). 3. Experimental Results We evaluated our approach on a subset of the Gene Ontology (GO) [23], namely, GO-Plant2 , which contains 97 classes and 155 logical axioms, among them 145 subclass axioms between concept names. For our experiments, we randomly deleted 50 of these subclass axioms and completed the resulting ontology with pacco in various test settings using the uniform distribution of subsumption queries in the sampling oracle. As base set 𝒞, we took all the 97 concept names. As expert, we used the Hermit reasoner in conjunction with GO-Plant. As expected, increasing 𝜖 results in a smaller number of generated axioms and a larger number of unrecovered axioms, i.e., TBoxes that have larger 𝒞-𝒟-distance to the original TBox. The effect of varying 𝛿 is much smaller, since the number of sampling queries used to simulate an equivalence query depends linearly on 1/𝜖 and only logarithmically on 1/𝛿. We compared the resulting TBoxes with the original GO-Plant using the ontology diff tool ecco3 [24]. Ecco finds differences between ontologies and reports them in separate categories, one of which contains axioms from one ontology not entailed by the other. These are listed under unrecovered axioms in Table 1. The numbers remain relatively large even for small values of 𝜖. The reason is that 𝜖 corresponds to the target maximal distance between the entire expert and completed TBoxes rather than between what has been removed and what has been recovered. In future, we plan to explore the empirical behavior of our algorithm by simulating the expert ℰ and the sampling oracle 𝒰 from data with different distributions. For example, we may sample frequently occurring subsets of 𝒞 to learn GCIs with high support in the sense of association rule mining. It could also be interesting to sample infrequent subsets of 𝒞 and learn GCIs with low support. One could 2 https://geneontology.org/docs/download-ontology 3 https://github.com/rsgoncalves/ecco say that GCIs highly supported by data can be accepted without resorting to an expert, whereas, for low-support GCIs, it is important to get a confirmation. It may be worthwhile to develop a modification of the algorithm to approximately complete both a TBox and an ABox w.r.t. a specific interpretation. This may require a slightly different notion of approximation accounting for the information contained in the ABox to be completed. Acknowledgments This work is partly supported by DFG in project 389792660 (TRR 248, Center for Perspicuous Systems), by BMBF in ScaDS.AI, and by BMBF and DAAD in project 57616814 (SECAI). References [1] S. Klarman, K. Britz, Ontology learning from interpretations in lightweight description logics, in: K. Inoue, H. Ohwada, A. Yamamoto (Eds.), Inductive Logic Programming - 25th International Conference, ILP 2015, Kyoto, Japan, August 20-22, 2015, Revised Selected Papers, volume 9575 of Lecture Notes in Computer Science, Springer, 2015, pp. 76–90. [2] B. Konev, C. Lutz, A. Ozaki, F. Wolter, Exact learning of lightweight description logic ontologies, J. Mach. Learn. Res. 18 (2017) 201:1–201:63. [3] N. Fanizzi, C. d’Amato, F. Esposito, DL-FOIL concept learning in description logics, in: F. Zelezný, N. Lavrac (Eds.), Proceedings of Inductive Logic Programming, 18th International Conference, ILP, volume 5194 of Lecture Notes in Computer Science, Springer, 2008, pp. 107–121. [4] M. Funk, J. C. Jung, C. Lutz, H. Pulcini, F. Wolter, Learning description logic concepts: When can positive and negative examples be separated?, in: S. Kraus (Ed.), Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI, ijcai.org, 2019, pp. 1682–1688. [5] J. Lehmann, Dl-learner: Learning concepts in description logics, J. Mach. Learn. Res. 10 (2009) 2639–2642. [6] J. Lehmann, Learning OWL Class Expressions, volume 6 of Studies on the Semantic Web, IOS Press, 2010. [7] J. Lehmann, P. Hitzler, Concept learning in description logics using refinement operators, Mach. Learn. 78 (2010) 203–250. [8] A. Ozaki, C. Persia, A. Mazzullo, Learning query inseparable ℰℒℋ ontologies, in: The Thirty- Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7-12, 2020, AAAI Press, 2020, pp. 2959–2966. [9] B. ten Cate, M. Funk, J. C. Jung, C. Lutz, Sat-based PAC learning of description logic concepts, in: Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence, IJCAI 2023, 19th-25th August 2023, Macao, SAR, China, ijcai.org, 2023, pp. 3347–3355. [10] A. Ozaki, Learning description logic ontologies: Five approaches. where do they stand?, Künstliche Intelligenz 34 (2020) 317–327. [11] A. Ozaki, On the complexity of learning description logic ontologies, in: M. Manna, A. Pieris (Eds.), Reasoning Web. Declarative Artificial Intelligence - 16th International Summer School 2020, Oslo, Norway, June 24-26, 2020, Tutorial Lectures, volume 12258 of Lecture Notes in Computer Science, Springer, 2020, pp. 36–52. [12] B. Ganter, R. Wille, Formal Concept Analysis: Mathematical Foundations, Springer-Verlag, Berlin, Germany, 1999. [13] F. Baader, F. Distel, Exploring finite models in the description logic ℰℒ𝑔𝑓 𝑝 , in: S. Ferré, S. Rudolph (Eds.), Proceedings of the 7th International Conference on Formal Concept Analysis, (ICFCA 2009), Springer-Verlag, 2009, pp. 146–161. [14] F. Distel, Learning description logic knowledge bases from data using methods from formal concept analysis, Ph.D. thesis, Dresden University of Technology, 2011. [15] D. Borchmann, F. Distel, F. Kriegel, Axiomatisation of general concept inclusions from finite interpretations, Journal of Applied Non-Classical Logics 26 (2016) 1–46. [16] F. Kriegel, Efficient axiomatization of OWL 2 EL ontologies from data by means of formal concept analysis, in: M. J. Wooldridge, J. G. Dy, S. Natarajan (Eds.), Thirty-Eighth AAAI Conference on Artificial Intelligence, AAAI 2024, Thirty-Sixth Conference on Innovative Applications of Artificial Intelligence, IAAI 2024, Fourteenth Symposium on Educational Advances in Artificial Intelligence, EAAI 2014, February 20-27, 2024, Vancouver, Canada, AAAI Press, 2024, pp. 10597–10606. [17] S. Obiedkov, B. Sertkaya, D. Zolotukhin, Probably approximately correct completion of description logic knowledge bases, in: M. Simkus, G. E. Weddell (Eds.), Proceedings of the 32nd International Workshop on Description Logics, Oslo, Norway, June 18-21, 2019, volume 2373 of CEUR Workshop Proceedings, CEUR-WS.org, 2019. [18] F. Baader, B. Ganter, U. Sattler, B. Sertkaya, Completing Description Logic Knowledge Bases using Formal Concept Analysis, LTCS-Report LTCS-06-02, Chair for Automata Theory, Insti- tute for Theoretical Computer Science, Dresden University of Technology, Germany, 2006. See http://lat.inf.tu-dresden.de/research/reports.html. [19] B. Sertkaya, Ontocomp system description, in: B. C. Grau, I. Horrocks, B. Motik, U. Sattler (Eds.), Proceedings of the 22nd International Workshop on Description Logics (DL 2009), Oxford, UK, July 27-30, 2009, volume 477 of CEUR Workshop Proceedings, CEUR-WS.org, 2009. [20] D. Borchmann, T. Hanika, S. Obiedkov, Probably approximately correct learning of horn envelopes from queries, Discrete Applied Mathematics 273 (2020) 30–42. Advances in Formal Concept Analysis: Traces of CLA 2016. [21] D. Angluin, M. Frazier, L. Pitt, Learning conjunctions of horn clauses, Machine Learning 9 (1992) 147–164. [22] R. Yarullin, S. Obiedkov, From equivalence queries to pac learning: The case of implication theories, International Journal of Approximate Reasoning 127 (2020) 1–16. [23] T. G. O. Consortium, Gene ontology: Tool for the unification of biology, Nature Genetics 25 (2000) 25–29. [24] R. S. Gonçalves, B. Parsia, U. Sattler, Ecco: A hybrid diff tool for OWL 2 ontologies, in: P. Klinov, M. Horridge (Eds.), Proceedings of OWL: Experiences and Directions Workshop 2012, Heraklion, Crete, Greece, May 27-28, 2012, volume 849 of CEUR Workshop Proceedings, CEUR-WS.org, 2012.