=Paper= {{Paper |id=Vol-3739/abstract-20 |storemode=property |title=Probably Approximately Correct Ontology Completion with pacco (Extended Abstract) |pdfUrl=https://ceur-ws.org/Vol-3739/abstract-20.pdf |volume=Vol-3739 |authors=Sergei Obiedkov,Barış Sertkaya |dblpUrl=https://dblp.org/rec/conf/dlog/ObiedkovS24 }} ==Probably Approximately Correct Ontology Completion with pacco (Extended Abstract)== https://ceur-ws.org/Vol-3739/abstract-20.pdf
                         Probably Approximately Correct Ontology Completion
                         with pacco (Extended Abstract)
                         Sergei Obiedkov1 , Barış Sertkaya2
                         1
                             Knowledge-Based Systems Group, TU Dresden, Dresden, Germany
                         2
                             Frankfurt University of Applied Sciences, Frankfurt, Germany

                         Keywords
                         Ontology learning, Angluin’s learning framework, PAC learning,




                         1. Introduction
                         Traditionally, ontology construction is conducted by knowledge engineers whose task is to formalize
                         relevant concepts of the application domain and the relationships among them. This process is tedious
                         and error-prone, and thus various methods to facilitate construction and, to some extent, ensure
                         completeness of the resulting ontology by querying a domain expert have been investigated. For
                         instance, learnability of axioms in lightweight DLs from a given set of interpretations has been studied
                         in [1], and exact learnability of lightweight DL ontologies in Angluin’s framework via queries has
                         been studied in [2]. In [3, 4, 5, 6, 7], learning DL concepts and axioms from given examples has been
                         investigated. Learning a DL ontology that is inseparable from a target ontology w.r.t. a query language
                         and a fixed dataset via querying an oracle has been investigated in [8]. In a recent work [9], PAC
                         learning of DL concepts has been studied. For an overview on learning DL ontologies see [10, 11]. In
                         another line of work, Formal Concept Analysis (FCA) [12] was employed for mining an ℰℒ⊥ -basis of all
                         axioms holding in a given model [13, 14]. In [15], a further approach for mining bases with a predefined
                         and fixed role depth of concept expressions was proposed. In a more recent work [16], an approach for
                         efficient mining of axioms from graph dataset was introduced and evaluated on large datasets.
                            The FCA-based approaches have the disadvantage that, in the worst case, they issue exponentially
                         many queries to the domain expert. In [17], we presented an approach combining the advantages of both
                         lines of research for completing the missing information in the ontology w.r.t. some given set of concept
                         descriptions via issuing only polynomially many (w.r.t. the relevant quantities) queries to an expert.
                         In the present work, we improve and implement the approach from [17] and present a PAC version
                         of the ontology-completion method, which was initially introduced in [18] and implemented in [19].
                         Our solution is based on an algorithm for PAC learning a Horn envelope of an arbitrary propositional
                         formula [20], which is itself based on an algorithm for exactly learning Horn formulas [21]. Our setting
                         is the following. Given:
                                • an initial TBox 𝒯0 ;
                                • an expert ℰ able to answer subsumption queries w.r.t. a TBox 𝒯ℰ that is unknown to us;
                                • a set 𝒞 of concept descriptions built over the signature of 𝒯ℰ ;
                                • a sampling oracle 𝒰 that, when called, returns a subsumption query over 𝒞 according to a
                                  probability distribution 𝒟𝒰 ;
                                • a probability 𝛿 and error bound 𝜖 with 0 < 𝜖, 𝛿 < 1;
                         compute a TBox 𝒯 such that 𝒯0 ⊆ 𝒯 and 𝒯 is, with probability at least 1 − 𝛿, an 𝜖-approximation of
                         𝒯ℰ . Building on our work [17], we introduce pacco1 , a tool for probably approximately correct (PAC)
                         completion of DL TBoxes.
                            DL 2024: 37th International Workshop on Description Logics, June 18–21, 2024, Bergen, Norway
                          $ sergei.obiedkov@tu-dresden.de (S. Obiedkov); sertkaya@fb2.fra-uas.de (B. Sertkaya)
                           0000-0003-1497-4001 (S. Obiedkov); 0000-0002-4196-0150 (B. Sertkaya)
                                       © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
                         1
                             https://github.com/sertkaya/pacco

CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
Definition 1. If 𝒯 is a TBox and 𝒞 is a finite set of concept descriptions, we denote by Impl(𝒯 , 𝒞) =
{𝑋 → 𝑌 | 𝑋, 𝑌 ⊆ 𝒞 and 𝒯 |= ⊓𝑋 ⊑ ⊓𝑌 } the set of implications corresponding to GCIs over
conjunctions of concepts from 𝒞 entailed by 𝒯 .
   Our approach to solving the above problem w.r.t. 𝒯0 , 𝒞, and ℰ is to learn a set ℒ of implications
that approximates Impl(𝒯 ℰ , 𝒞). For this, we need an oracle capable of answering membership and
equivalence queries with respect to Impl(𝒯 ℰ , 𝒞). A membership query asks whether a certain subset
of 𝒞 is a model of Impl(𝒯 ℰ , 𝒞), and an equivalence query asks for a subset of 𝒞 that is a model of either
Impl(𝒯 ℰ , 𝒞) or ℒ but not both. We rely on a (domain) expert oracle ℰ answering subsumption queries
over 𝒞 of the form 𝒯 ℰ |= ⊓𝑋 ⊑ 𝐶, where 𝑋 ⊆ 𝒞 and 𝐶 ∈ 𝒞.


2. Probably Approximately Correct Completion
As a measure of approximation for TBoxes, we consider how often the two TBoxes give the same
answers to subsumption queries:
Definition 2. Let 𝒯1 and 𝒯2 be two TBoxes, 𝒞 be a finite set of concept descriptions, and 𝒟 be a
probability distribution of subsumption queries over 𝒞. We define dist𝒟    𝒞 (𝒯1 , 𝒯2 ), the 𝒞-𝒟-distance
between 𝒯1 and 𝒯2 , as the probability of getting different responses to a subsumption query w.r.t. 𝒯1
and 𝒯2 : Pr𝒟 (𝑄 | (𝒯1 |= 𝑄) ⇔ (𝒯2 ̸|= 𝑄)), where 𝑄 is a subsumption query over 𝒞. For a given
0 < 𝜖 < 1, we call a TBox 𝒯 an 𝜖-𝒞-𝒟-approximation of 𝒯 * if dist𝒟      𝒞 (𝒯 , 𝒯 ) ≤ 𝜖. If, in addition,
                                                                                *

𝒯 |= 𝒯 , i.e., the set of models of 𝒯 is a subset of the set of models of 𝒯 , then we say that 𝒯 is an
  *                                  *

upper 𝜖-𝒞-𝒟-approximation of 𝒯 * .
Proposition 1. If 𝒯 is an upper 𝜖-𝒞-𝒟-approximation of 𝒯 * , then the 𝒞-𝒟-distance between 𝒯 and
𝒯 * is equal to Pr𝒟 (𝑄 | 𝒯 ̸|= 𝑄 and 𝒯 * |= 𝑄).
   Our algorithm has access to a sampling oracle 𝒰 simulating a user that, when called, returns a
subsumption query over 𝒞 according to a probability distribution 𝒟𝒰 . Given parameters 0 < 𝜖, 𝛿 < 1,
it uses oracles ℰ and 𝒰 to compute a TBox 𝒯 that, with probability at least 1 − 𝛿, is an upper 𝜖-𝒞-𝒟𝒰 -
approximation of 𝒯 ℰ .
   The algorithm internally maintains a list of implications over 𝒞 such that, after termination, the
corresponding set of GCIs is a desired approximation of 𝒯 ℰ . It starts with the empty list ℒ of implications
and uses counterexamples obtained from (simulated) equivalence queries to update ℒ. In the algorithms
from [21] and [20], both positive and negative counterexamples are possible. A positive counterexample
may be returned if the current hypothesis contains an implication not entailed by the target Horn
formula. In our setting, such an implication corresponds to a CGI not entailed by the target TBox 𝒯 ℰ .
Since our goal is to obtain an upper approximation of 𝒯 ℰ , we do not allow such GCIs. Therefore, we
modify the algorithm to make sure that ℒ always contains only implications corresponding to GCIs
entailed by 𝒯 ℰ . This ensures that the resulting approximation of 𝒯 ℰ is an upper approximation.
   We simulate every equivalence query with several calls to the sampling oracle 𝒰. If not all valid GCIs
are entailed by GCI(ℒ) = {⊓𝑋 ⊑ ⊓𝑌 | 𝑋 → 𝑌 ∈ ℒ}, we expect an equivalence query to return a
subset 𝑋 of 𝒞 closed under ℒ but not under Impl(𝒯 ℰ , 𝒞). Since we aim at an 𝜖-approximation, we must
be able to obtain, with an appropriate probability, such 𝑋 whenever dist𝒟                    ℰ
                                                                               𝒞 (GCI(ℒ), 𝒯 ) > 𝜖.
   Every iteration of the algorithm
                                  ⌈︁ starts with a⌉︁ search for a counterexample 𝑋 to GCI(ℒ). For the 𝑖th
iteration, our algorithm makes log1−𝜖 𝑖(𝑖+1) 𝛿
                                                     attempts to generate a counterexample [22]. Since ℒ
contains only valid implications, the counterexample, if found, is always negative and is a model of ℒ.
The rest of the iteration eliminates some counterexample 𝑌 ⊆ 𝑋 by making sure that, for every 𝐶 ∈ 𝒞,
the responses to the query ⊓𝑌 ⊑ 𝐶 w.r.t. GCI(ℒ) and 𝒯ℰ are identical. To do this, the algorithm
searches ℒ for the first implication 𝑈 → 𝑉 such that 𝑌 = 𝑈 ∩ 𝑋 ̸= 𝑈 and 𝒯 ℰ |= ⊓𝑌 ⊑ 𝐶 for some
𝐶 ∈ 𝑉 ∖ 𝑌 . If such an implication is found, it is refined to 𝑌 → Compl(𝑌 ), where Compl(𝑌 ) is the
completion of 𝑌 w.r.t. 𝒞 and 𝒯 ℰ , i.e., Compl(𝑌 ) = {𝐶 ∈ 𝒞 | 𝒯 ℰ |= ⊓𝑌 ⊑ 𝐶}. Otherwise, a new
implication 𝑋 → Compl(𝑋) is added to the end of ℒ. In both cases, the new implication is valid w.r.t.
                                                     Number of                            Execution
              𝛿       𝜖    samples      queries   generated axioms   unrecovered axioms   time (sec.)
             0.01   0.01    17502        18628           28                  33           283
                     0.1     3467        6474            23                  35           147
                     0.2     1690        4142            18                  38           67
                     0.3     359         1226             6                  46           10
              0.1   0.01    14456        16172           27                  32           254
                     0.1     3347        6431            22                  34           132
                     0.2     1084        2719            10                  43           44
                     0.2     745         2458             9                  44           34
              0.2   0.01    14965        16371           26                  36           221
                     0.1     3338        6383            22                  36           128
                     0.2     502         1533             6                  47           16
                     0.3     209          773             3                  48           5
              0.3   0.01    14332        16039           26                  34           226
                     0.1     3560        6923            22                  34           140
                     0.2     1375        3735            15                  39           56
                     0.3      19           88             1                  49           <1
Table 1
Evalutation results completing GO-Plant. All values are rounded arithmetic means of 5 different runs.


𝒯 ℰ . Note that the computation of completion requires 𝑂(|𝒞|) queries to the oracle ℰ and no calls to the
sampling oracle 𝒰. Combining this with the results in [20], we obtain
Theorem 1. Given a domain expert oracle ℰ and a sampling oracle 𝒰, our algorithm computes, with
probability at least 1 − 𝛿, an upper 𝜖-𝒞-𝒟-approximation of 𝒯 ℰ . The number of queries to ℰ and 𝒰 posed
by the algorithm is polynomial in |𝒞|, 1/𝜖, 1/𝛿, and the minimal size of an implication set equivalent to
Impl(𝒯 ℰ , 𝒞).


3. Experimental Results
We evaluated our approach on a subset of the Gene Ontology (GO) [23], namely, GO-Plant2 , which
contains 97 classes and 155 logical axioms, among them 145 subclass axioms between concept names.
For our experiments, we randomly deleted 50 of these subclass axioms and completed the resulting
ontology with pacco in various test settings using the uniform distribution of subsumption queries in
the sampling oracle. As base set 𝒞, we took all the 97 concept names. As expert, we used the Hermit
reasoner in conjunction with GO-Plant.
   As expected, increasing 𝜖 results in a smaller number of generated axioms and a larger number of
unrecovered axioms, i.e., TBoxes that have larger 𝒞-𝒟-distance to the original TBox. The effect of
varying 𝛿 is much smaller, since the number of sampling queries used to simulate an equivalence query
depends linearly on 1/𝜖 and only logarithmically on 1/𝛿.
   We compared the resulting TBoxes with the original GO-Plant using the ontology diff tool ecco3 [24].
Ecco finds differences between ontologies and reports them in separate categories, one of which contains
axioms from one ontology not entailed by the other. These are listed under unrecovered axioms in
Table 1. The numbers remain relatively large even for small values of 𝜖. The reason is that 𝜖 corresponds
to the target maximal distance between the entire expert and completed TBoxes rather than between
what has been removed and what has been recovered.
   In future, we plan to explore the empirical behavior of our algorithm by simulating the expert ℰ and
the sampling oracle 𝒰 from data with different distributions. For example, we may sample frequently
occurring subsets of 𝒞 to learn GCIs with high support in the sense of association rule mining. It could
also be interesting to sample infrequent subsets of 𝒞 and learn GCIs with low support. One could
2
    https://geneontology.org/docs/download-ontology
3
    https://github.com/rsgoncalves/ecco
say that GCIs highly supported by data can be accepted without resorting to an expert, whereas, for
low-support GCIs, it is important to get a confirmation. It may be worthwhile to develop a modification
of the algorithm to approximately complete both a TBox and an ABox w.r.t. a specific interpretation.
This may require a slightly different notion of approximation accounting for the information contained
in the ABox to be completed.


Acknowledgments
This work is partly supported by DFG in project 389792660 (TRR 248, Center for Perspicuous Systems),
by BMBF in ScaDS.AI, and by BMBF and DAAD in project 57616814 (SECAI).


References
 [1] S. Klarman, K. Britz, Ontology learning from interpretations in lightweight description logics,
     in: K. Inoue, H. Ohwada, A. Yamamoto (Eds.), Inductive Logic Programming - 25th International
     Conference, ILP 2015, Kyoto, Japan, August 20-22, 2015, Revised Selected Papers, volume 9575 of
     Lecture Notes in Computer Science, Springer, 2015, pp. 76–90.
 [2] B. Konev, C. Lutz, A. Ozaki, F. Wolter, Exact learning of lightweight description logic ontologies, J.
     Mach. Learn. Res. 18 (2017) 201:1–201:63.
 [3] N. Fanizzi, C. d’Amato, F. Esposito, DL-FOIL concept learning in description logics, in: F. Zelezný,
     N. Lavrac (Eds.), Proceedings of Inductive Logic Programming, 18th International Conference, ILP,
     volume 5194 of Lecture Notes in Computer Science, Springer, 2008, pp. 107–121.
 [4] M. Funk, J. C. Jung, C. Lutz, H. Pulcini, F. Wolter, Learning description logic concepts: When can
     positive and negative examples be separated?, in: S. Kraus (Ed.), Proceedings of the Twenty-Eighth
     International Joint Conference on Artificial Intelligence, IJCAI, ijcai.org, 2019, pp. 1682–1688.
 [5] J. Lehmann, Dl-learner: Learning concepts in description logics, J. Mach. Learn. Res. 10 (2009)
     2639–2642.
 [6] J. Lehmann, Learning OWL Class Expressions, volume 6 of Studies on the Semantic Web, IOS Press,
     2010.
 [7] J. Lehmann, P. Hitzler, Concept learning in description logics using refinement operators, Mach.
     Learn. 78 (2010) 203–250.
 [8] A. Ozaki, C. Persia, A. Mazzullo, Learning query inseparable ℰℒℋ ontologies, in: The Thirty-
     Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative
     Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on
     Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7-12,
     2020, AAAI Press, 2020, pp. 2959–2966.
 [9] B. ten Cate, M. Funk, J. C. Jung, C. Lutz, Sat-based PAC learning of description logic concepts, in:
     Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence, IJCAI
     2023, 19th-25th August 2023, Macao, SAR, China, ijcai.org, 2023, pp. 3347–3355.
[10] A. Ozaki, Learning description logic ontologies: Five approaches. where do they stand?, Künstliche
     Intelligenz 34 (2020) 317–327.
[11] A. Ozaki, On the complexity of learning description logic ontologies, in: M. Manna, A. Pieris
     (Eds.), Reasoning Web. Declarative Artificial Intelligence - 16th International Summer School 2020,
     Oslo, Norway, June 24-26, 2020, Tutorial Lectures, volume 12258 of Lecture Notes in Computer
     Science, Springer, 2020, pp. 36–52.
[12] B. Ganter, R. Wille, Formal Concept Analysis: Mathematical Foundations, Springer-Verlag, Berlin,
     Germany, 1999.
[13] F. Baader, F. Distel, Exploring finite models in the description logic ℰℒ𝑔𝑓 𝑝 , in: S. Ferré, S. Rudolph
     (Eds.), Proceedings of the 7th International Conference on Formal Concept Analysis, (ICFCA 2009),
     Springer-Verlag, 2009, pp. 146–161.
[14] F. Distel, Learning description logic knowledge bases from data using methods from formal concept
     analysis, Ph.D. thesis, Dresden University of Technology, 2011.
[15] D. Borchmann, F. Distel, F. Kriegel, Axiomatisation of general concept inclusions from finite
     interpretations, Journal of Applied Non-Classical Logics 26 (2016) 1–46.
[16] F. Kriegel, Efficient axiomatization of OWL 2 EL ontologies from data by means of formal concept
     analysis, in: M. J. Wooldridge, J. G. Dy, S. Natarajan (Eds.), Thirty-Eighth AAAI Conference on
     Artificial Intelligence, AAAI 2024, Thirty-Sixth Conference on Innovative Applications of Artificial
     Intelligence, IAAI 2024, Fourteenth Symposium on Educational Advances in Artificial Intelligence,
     EAAI 2014, February 20-27, 2024, Vancouver, Canada, AAAI Press, 2024, pp. 10597–10606.
[17] S. Obiedkov, B. Sertkaya, D. Zolotukhin, Probably approximately correct completion of description
     logic knowledge bases, in: M. Simkus, G. E. Weddell (Eds.), Proceedings of the 32nd International
     Workshop on Description Logics, Oslo, Norway, June 18-21, 2019, volume 2373 of CEUR Workshop
     Proceedings, CEUR-WS.org, 2019.
[18] F. Baader, B. Ganter, U. Sattler, B. Sertkaya, Completing Description Logic Knowledge Bases
     using Formal Concept Analysis, LTCS-Report LTCS-06-02, Chair for Automata Theory, Insti-
     tute for Theoretical Computer Science, Dresden University of Technology, Germany, 2006. See
     http://lat.inf.tu-dresden.de/research/reports.html.
[19] B. Sertkaya, Ontocomp system description, in: B. C. Grau, I. Horrocks, B. Motik, U. Sattler (Eds.),
     Proceedings of the 22nd International Workshop on Description Logics (DL 2009), Oxford, UK,
     July 27-30, 2009, volume 477 of CEUR Workshop Proceedings, CEUR-WS.org, 2009.
[20] D. Borchmann, T. Hanika, S. Obiedkov, Probably approximately correct learning of horn envelopes
     from queries, Discrete Applied Mathematics 273 (2020) 30–42. Advances in Formal Concept
     Analysis: Traces of CLA 2016.
[21] D. Angluin, M. Frazier, L. Pitt, Learning conjunctions of horn clauses, Machine Learning 9 (1992)
     147–164.
[22] R. Yarullin, S. Obiedkov, From equivalence queries to pac learning: The case of implication theories,
     International Journal of Approximate Reasoning 127 (2020) 1–16.
[23] T. G. O. Consortium, Gene ontology: Tool for the unification of biology, Nature Genetics 25 (2000)
     25–29.
[24] R. S. Gonçalves, B. Parsia, U. Sattler, Ecco: A hybrid diff tool for OWL 2 ontologies, in: P. Klinov,
     M. Horridge (Eds.), Proceedings of OWL: Experiences and Directions Workshop 2012, Heraklion,
     Crete, Greece, May 27-28, 2012, volume 849 of CEUR Workshop Proceedings, CEUR-WS.org, 2012.