=Paper=
{{Paper
|id=Vol-3739/abstract-14
|storemode=property
|title=Efficient Axiomatization of OWL 2 EL Ontologies from Data by means of Formal Concept Analysis (Extended Abstract)
|pdfUrl=https://ceur-ws.org/Vol-3739/abstract-14.pdf
|volume=Vol-3739
|authors=Francesco Kriegel
|dblpUrl=https://dblp.org/rec/conf/dlog/Kriegel24
}}
==Efficient Axiomatization of OWL 2 EL Ontologies from Data by means of Formal Concept Analysis (Extended Abstract)==
Efficient Axiomatization of OWL 2 EL Ontologies from Data by means of Formal Concept Analysis Extended Abstract1 Francesco Kriegel Theoretical Computer Science, Technische UniversitΓ€t Dresden, Dresden, Germany Center for Scalable Data Analytics and Artificial Intelligence (ScaDS.AI), Dresden/Leipzig, Germany Building and maintaining ontologies is a laborious task, especially for large domains, where knowledge engineers and domain experts work together to transfer their knowledge into an explicit representation. In Description Logic, the ABox of an ontology is usually filled with observed symbolic data but constructing the TBox is a more complex endeavor. Assistance by automated or interactive methods is often valuable. To this end, we reconsider the Formal-Concept-Analysis-based approach to completely axiomatizing β°ββ₯ concept inclusions πΆ β π· from graph data [3, 4] and 1. thoroughly revise and simplify its technical description including proofs, 2. equip it with support for already known concept inclusions satisfied in the data (thus enabling it for ontology completion), 3. analyze its computational complexity, 4. explain how further types of TBox statements supported by β°β++ that are not just syntactic sugar can be completely axiomatized, viz. role inclusions π1 β Β· Β· Β· β ππ β π and range restrictions β€ β βπ.πΆ, 5. describe how it can be implemented efficiently, 6. introduce variations that dispense with the computation of disjointness statements πΆ1 β Β· Β· Β· β πΆπ β β₯ or extremely large concept inclusions without practical relevance, thereby rendering the approach applicable in practice, albeit some completeness is lost, 7. and evaluate the implementation on real-world datasets. Formal Concept Analysis (FCA) [5] is a mathematical theory that represents data as formal contexts in which objects are described by their attributes. These attributes are similar to atomic statements in propositional logic and unary predicates in first-order logic. The canonical implication base is a complete set of implications, i.e. it entails all implications satisfied in the data [6, 7], and no complete set with fewer implications exists [8, 9]. In other words, the implication base axiomatizes data in form of a formal context by means of implications. We employ the description logic β°β++ [10], but we ignore nominals to avoid overfitting in the axiomatization method and concrete domains (datatypes for strings, numbers, etc.) as no β°β++ reasoner currently supports them. The Web Ontology Language includes β°β++ as the profile OWL 2 EL. By exploiting the similarity between concept inclusions and FCA implications, a complete TBox of concept inclusions can be axiomatized from observed data, i.e. which entails all concept inclusions satisfied in the data. More specifically, as input we expect graph data in form of an interpretation β, which includes knowledge graphs, graph databases, and RDF data: the concept names are the node labels and the role names are the edge labels. Preprocessing of a knowledge graph might be necessary, e.g. to correctly treat the metadata as well as to materialize the modelling conventions [11]. Further given may be an already existing TBox π― consisting of concept inclusions satisfied in β and relative to which β is to be 1 This is an extended abstract of an article [1] published in the proceedings of the 38th Annual AAAI Conference on Artificial Intelligence (AAAI-24) and of its extended version [2]. DL 2024: 37th International Workshop on Description Logics, June 18β21, 2024, Bergen, Norway $ francesco.kriegel@tu-dresden.de (F. Kriegel) 0000-0003-0219-0330 (F. Kriegel) Β© 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings axiomatized, or rather, of which we compute a completion w.r.t. β. The axiomatization result is called a concept inclusion base or a completion in the following sense. [1, Definition 4]. A TBox is complete for β if it entails all concept inclusions satisfied in β. A concept inclusion base of β relative to π― is a TBox β¬ of which β is a model and such that β¬ βͺ π― is complete for β. We may also call β¬ a completion of π― w.r.t. β as we obtain a complete TBox by adding all concept inclusions in β¬ to π― . A canonical completion of π― w.r.t. β can be computed in exponential time [1, Theorem 10] (see below). This complexity result is tight since there are finite interpretations without any polynomial-size base.1 Moreover, if all concept inclusions in π― come in a particular normal form, then no base with fewer concept inclusions exists. In order to efficiently treat cycles in the input (both contained in β or induced by π― ), the concept inclusions are formulated in an extension of β°ββ₯ that allows for non-tree-shaped concept descriptions. To obtain a usual β°β++ TBox, the canonical concept inclusion base afterwards needs to be rewritten using variables. In order to compute the canonical completion, the given interpretation β is transformed into a formal context Kβ . Its finite attribute set M is specifically designed such thatd theredis a concept inclusion base involving only conjunctions over M, i.e. which consists of inclusions Cβ D for subsets C, D β M [1, Lemma 9]. This setting is perfectly suited for FCA since all necessary computations happen in the top-level conjunctions and there is no need to look inside the concept descriptions provided by M (which FCA could not do anyway). In particular, we first use an efficient FCA algorithm βοΈ [13, 14] βοΈ to compute the canonical base Can(Kβ ), which consists of FCA implications Cdβ D (i.e. d C β D in logical notation), and we then rewrite these implications into inclusions C β D to obtain a minimal concept inclusion base of β. Moreover, the formal context Kβ is axiomatized relative to the implication set ββ,π― . On the one hand, ββ,π― contains the implication {πΈ} β {πΉ } for each two concept descriptions πΈ, πΉ β M with πΈ ββ πΉ , viz. to avoid the axiomatization of tautological concept inclusions. On the other hand, the given TBox π― is taken into account by transforming all its inclusions into implications over M and adding these to ββ,π― . β₯ [1, Theorem 10]. For d each finite interpretation β and each β°βsi TBox π― of which β is a model, the TBox Can(β, π― ) := Can(Kβ , ββ,π― ) is a concept inclusion base of β relative to π― . It is called canonical concept inclusion base and can be computed in time that is exponential in Dom(β) and polynomial in π― . If all concept inclusions in π― have the form πΆ β π·[ββ] , then it contains the fewest inclusions among all concept inclusion bases of β relative to π― . Furthermore, there are finite interpretations that have no polynomial-size concept inclusion base. In addition to concept inclusions, β°β++ further supports role inclusions and range restrictions. The latter can easily be read off from the input: for each role name π, we first compute the most specific concept description πΆ that has all π-successors in β as instances and we then add the range restriction β€ β βπ.πΆ to the TBox. All role inclusions can be completely axiomatized by viewing β as a finite automaton. Specifically for objects π₯, π¦ in β we denote by Aπ₯,π¦ the automaton with initial state π₯ and final state π¦. Now a role inclusion π1 β Β· Β· Β· β ππ β π is not satisfied in β iff. there are objects π₯, π¦ in β with (π₯, π¦) β (π1 β Β· Β· Β· β ππ )β but (π₯, π¦) ΜΈβ π β . By definition, (π₯, π¦) β (π1 β Β· Β· Β· β ππ )β iff. the automaton Aπ₯,π¦ accepts the word π1 Β· Β· Β· ππ . So, the complement automaton of the union automaton of all Aπ₯,π¦ with (π₯, π¦) ΜΈβ π β accepts the word π1 Β· Β· Β· ππ iff. the role inclusion π1 β Β· Β· Β· β ππ β π is satisfied in β. Like for the concept inclusions, we use variables to formulate the axiomatized role inclusions: π β π β π for each transition (π, π, π), and π β π for each initial state, and π β π for each final state. It is decidable whether there are equivalent 1 Since each formal context can be seen as an interpretation without role names, this is an immediate corollary to a result in FCA: there is a sequence of formal contexts with 3Β·π objects and 2Β·π+1 attributes for which the number of implications in their canonical implication bases is exponential in π [12]. role inclusions without variables [15], but this seems unnecessary since most reasoners transform role inclusions into finite automata anyway. All in all, we obtain the following main result. [1, Theorem 13]. For each finite interpretation β, a complete TBox of β°ββ₯ concept inclusions, range restrictions, and role inclusions satisfied in β can be computed in exponential time. There are finite interpretations for which such a TBox cannot be of polynomial size. A technical limitation is that completeness does not go together with the syntactic restriction on the interplay of role inclusions and range restrictions in an β°β++ TBox π― that ensures tractable reasoning [10]: for each role inclusion π1 β Β· Β· Β· β ππ β π in π― where π β₯ 1, if π― does not entail the range restriction β€ β βππ .πΆ, then π― neither entails β€ β βπ .πΆ (i.e. new concept memberships for objects in the range of π are forbidden). This restriction might not be satisfied by a TBox that is complete for both role inclusions and range restrictions [2, Example XXIII]. To ensure that the TBox is within β°β++ , we could weaken the range restrictions: for each role name π , we obtain a suitable range restriction β€ β βπ .πΆ by computing the most specific concept description πΆ that has all π -successors in β as instances (as above) but also all π-successors for each role name π leading to a final state (since these represent role inclusions Β· Β· Β· β π β π ). Alternatively, we could remove all role inclusions that contribute to a violation of the syntactic restriction (by simply computing a language difference in the above automaton representation) and leave the range restrictions unchanged. To sum up, only two of the following goals can be achieved: the base satisfies the syntactic restriction, the base is complete for all range restrictions, the base is complete for all role inclusions. Regarding efficient implementation, it is important to reduce the input interpretation by grouping together all objects that cannot be distinguished by any concept description. Thereby the interpretation can be made significantly smaller and all subsequent steps need less time. However, in first experiments several computations did not finish due to extremely large concept descriptions used in the concept inclusion base to ensure completeness. We conjecture that such huge parts in the concept inclusion base do not have practical relevance or suffer from overfitting, and thus we added parameters (conjunction size limit and role depth bound) to dispense with the computation of these irrelevant huge parts but also to control the loss of completeness. Experiments further revealed that often more than half of the computation time is required for generating disjointness statements πΆ1 β Β· Β· Β· β πΆπ β β₯. We explain how intermediate computation steps can be stopped early in order to avoid computing them. We implemented the method in the programming language Scala and we evaluated the prototype with the plethora of ontologies from real-world applications used in the ORE 2015 Reasoner Competition. This collection is split into OWL 2 EL and OWL 2 DL ontologies. The former are all expressible in β°β++ . For the latter, we syntactically transform as many axioms as possible into β°β++ and remove the others. The goal then was to compute, for each such ontology, the completion of the TBox π― w.r.t. the interpretation β obtained by viewing the respective ABox under closed-world assumption. To ensure that β is a model of π― , we saturate β by means of the inclusions in π― if necessary. Altogether we obtained 614 test datasets with up to 747,998 objects, of which 446 (72.64 %) are acyclic. The average number of triples per object varies from slightly over 0 up to 25.39. Computation of concept inclusion bases finished for all reduced datasets with no more than 100 objects. For reduced datasets with up to 1,000 objects, the first errors due to insufficient computing resources occurred without restrictions. Between 1,000 and 10,000 objects, computations failed without restrictions, but otherwise succeeded in the majority of cases. Reduced datasets with more than 10,000 objects could only sometimes be axiomatized with very restricted settings, given 8 hours time and 80 GB memory on a twelve-year-old computer server. However, we did not implement the rewriting of the concept inclusion base into β°β++ , nor the axiomatization of role inclusions and range restrictions. That the theoretical approach itself can be extended to more expressive description logics has already been proven [16, 17], but it is unclear whether such an extended approach can still be efficiently implemented and used in practice. From the perspective of the underlying article [1], this seems possible for description logics characterized by simulations, e.g. β°ββ or Horn-πβπ. An interesting question for future research would be whether one can give any kind of completeness guarantee if a conjunction size limit is used, as already done for the role depth bound [18]. A smaller task would be to investigate how role inclusions and range restrictions can be integrated into the background knowledge after they have been computed but prior to axiomatizing the concept inclusions, preferably yielding an overall minimal base. Furthermore, the computation can be speed-up with even faster FCA algorithms for enumerating closures. The employed FCA algorithm [13, 14] is currently the fastest algorithm for computing the canonical implication base, but it is unfortunately only single-threaded. Developing a multi-threaded variant is thus another future goal. It might already help to change its depth-first behaviour. Apart from that one could use a faster programming language (like C++), more computation time, a faster computer server, or optimize the prototype. A concept inclusion πΆ β π· is confident if the ratio |(πΆ β π·)β |/|πΆ β | exceeds a pre-defined limit but need not be 100 %. A confident inclusion base extends the canonical inclusion base [19], and the prototype could be easily upgraded as it already computes the necessary pieces. We have not considered keys supported by the OWL 2 EL profile. Learning of keys from RDF data using FCA has been addressed [20β22]. To apply this approach to description logic and OWL it must be extended towards concept descriptions in place of concept names (RDF classes). Acknowledgments This work has been supported by Deutsche Forschungsgemeinschaft (DFG) in Projects 430150274 (Repairing Description Logic Ontologies) and 389792660 (TRR 248: Foundations of Perspicuous Software Systems), and has further been supported by the Saxon State Ministry for Science, Culture, and Tourism (SMWK) by funding the Center for Scalable Data Analytics and Artificial Intelligence (ScaDS.AI). References [1] Francesco Kriegel. Efficient Axiomatization of OWL 2 EL Ontologies from Data by means of Formal Concept Analysis. In: Proc. of AAAI. 2024. doi: 10.1609/aaai.v38i9.28930. [2] Francesco Kriegel. Efficient Axiomatization of OWL 2 EL Ontologies from Data by means of Formal Concept Analysis (Extended Version). LTCS-Report 23-01. Technische UniversitΓ€t Dresden, 2023. doi: 10.25368/2023.214. See also the addendum: 10.5281/zenodo.10908141. [3] Franz Baader, Felix Distel. A Finite Basis for the Set of β°β-Implications Holding in a Finite Model. In: Proc. of ICFCA. 2008, pp. 46β61. doi: 10.1007/978-3-540-78137-0_4. [4] Franz Baader, Felix Distel. Exploring Finite Models in the Description Logic β°βgfp . In: Proc. of ICFCA. 2009, pp. 146β161. doi: 10.1007/978-3-642-01815-2_12. [5] Bernhard Ganter, Rudolf Wille. Formal Concept Analysis β Mathematical Foundations. 1999. doi: 10.1007/978-3-642-59830-2. [6] Jean-Luc Guigues, Vincent Duquenne. Famille minimale dβimplications informatives rΓ©sultant dβun tableau de donnΓ©es binaires. In: MathΓ©matiques et Sciences Humaines 95 (1986), pp. 5β18. url: http://www.numdam.org/item/MSH_1986__95__5_0.pdf. [7] Gerd Stumme. Attribute Exploration with Background Implications and Exceptions. In: 1996, pp. 457β469. doi: 10.1007/978-3-642-80098-6_39. [8] Marcel Wild. A Theory of Finite Closure Spaces Based on Implications. In: Advances in Mathe- matics 108.1 (1994), pp. 118β139. doi: 10.1006/aima.1994.1069. [9] Felix Distel. Learning description logic knowledge bases from data using methods from formal concept analysis. Doctoral thesis. Technische UniversitΓ€t Dresden, 2011. url: https : / / nbn - resolving.org/urn:nbn:de:bsz:14-qucosa-70199. [10] Franz Baader, Sebastian Brandt, Carsten Lutz. Pushing the β°β Envelope Further. In: Proc. of OWLED. 2008. url: http://ceur-ws.org/Vol-496/owled2008dc_paper_3.pdf. [11] Markus KrΓΆtzsch. Too Much Information: Can AI Cope with Modern Knowledge Graphs? In: Proc. of ICFCA. 2019, pp. 17β31. doi: 10.1007/978-3-030-21462-3_2. [12] Sergei O. Kuznetsov. On the Intractability of Computing the Duquenne-Guigues Base. In: J. Univers. Comput. Sci. 10.8 (2004), pp. 927β933. doi: 10.3217/JUCS-010-08-0927. [13] Radek JanoΕ‘tΓk, Jan KoneΔnΓ½, Petr KrajΔa. LinCbO: Fast algorithm for computation of the Duquenne-Guigues basis. In: Inf. Sci. 572 (2021), pp. 223β240. doi: 10.1016/j.ins.2021.04.104. [14] Radek JanoΕ‘tΓk, Jan KoneΔnΓ½, Petr KrajΔa. Pruning techniques in LinCbO for the computation of the Duquenne-Guigues basis. In: Inf. Sci. 616 (2022), pp. 182β203. doi: 10.1016/j.ins.2022.10.057. [15] Walter Bucher, Johann Hagauer. It is Decidable Whether a Regular Language is Pure Context-Free. In: Theor. Comput. Sci. 26 (1983), pp. 233β241. doi: 10.1016/0304-3975(83)90088-9. [16] Francesco Kriegel. Acquisition of Terminological Knowledge from Social Networks in Description Logic. In: Formal Concept Analysis of Social Networks. 2017, pp. 97β142. doi: 10.1007/978-3-319- 64167-6_5. [17] Francesco Kriegel. Joining Implications in Formal Contexts and Inductive Learning in a Horn Description Logic. In: Proc. of ICFCA. 2019, pp. 110β129. doi: 10.1007/978-3-030-21462-3_9. [18] Daniel Borchmann, Felix Distel, Francesco Kriegel. Axiomatisation of general concept inclusions from finite interpretations. In: J. Appl. Non Class. Logics 26.1 (2016), pp. 1β46. doi: 10.1080/ 11663081.2016.1168230. [19] Daniel Borchmann. Towards an Error-Tolerant Construction of β°ββ₯ -Ontologies from Data Using Formal Concept Analysis. In: Proc. of ICFCA. 2013, pp. 60β75. doi: 10.1007/978-3-642-38317-5_4. [20] Manuel Atencia, JΓ©rΓ΄me David, JΓ©rΓ΄me Euzenat, Amedeo Napoli, JΓ©rΓ©my Vizzini. Link key candidate extraction with relational concept analysis. In: Discret. Appl. Math. 273 (2020), pp. 2β20. doi: 10.1016/j.dam.2019.02.012. [21] Nacira Abbas, Alexandre Bazin, JΓ©rΓ΄me David, Amedeo Napoli. Non-Redundant Link Keys in RDF Data: Preliminary Steps. In: Proc. of FCA4AI. 2021, pp. 125β130. url: https://ceur-ws.org/Vol- 2972/paper12.pdf. [22] Nacira Abbas, Alexandre Bazin, JΓ©rΓ΄me David, Amedeo Napoli. A Study of the Discovery and Redundancy of Link Keys Between Two RDF Datasets Based on Partition Pattern Structures. In: Proc. of CLA. 2022, pp. 175β189. url: https://ceur-ws.org/Vol-3308/Paper14.pdf.