=Paper=
{{Paper
|id=Vol-3739/abstract-14
|storemode=property
|title=Efficient Axiomatization of OWL 2 EL Ontologies from Data by means of Formal Concept Analysis (Extended Abstract)
|pdfUrl=https://ceur-ws.org/Vol-3739/abstract-14.pdf
|volume=Vol-3739
|authors=Francesco Kriegel
|dblpUrl=https://dblp.org/rec/conf/dlog/Kriegel24
}}
==Efficient Axiomatization of OWL 2 EL Ontologies from Data by means of Formal Concept Analysis (Extended Abstract)==
<pdf width="1500px">https://ceur-ws.org/Vol-3739/abstract-14.pdf</pdf>
<pre>
                         Efficient Axiomatization of OWL 2 EL Ontologies
                         from Data by means of Formal Concept Analysis
                         Extended Abstract1

                         Francesco Kriegel
                         Theoretical Computer Science, Technische Universität Dresden, Dresden, Germany
                         Center for Scalable Data Analytics and Artificial Intelligence (ScaDS.AI), Dresden/Leipzig, Germany


                            Building and maintaining ontologies is a laborious task, especially for large domains, where knowledge
                         engineers and domain experts work together to transfer their knowledge into an explicit representation.
                         In Description Logic, the ABox of an ontology is usually filled with observed symbolic data but
                         constructing the TBox is a more complex endeavor. Assistance by automated or interactive methods is
                         often valuable. To this end, we reconsider the Formal-Concept-Analysis-based approach to completely
                         axiomatizing ℰℒ⊥ concept inclusions 𝐶 ⊑ 𝐷 from graph data [3, 4] and

                                1. thoroughly revise and simplify its technical description including proofs,
                                2. equip it with support for already known concept inclusions satisfied in the data (thus enabling it
                                   for ontology completion),
                                3. analyze its computational complexity,
                                4. explain how further types of TBox statements supported by ℰℒ++ that are not just syntactic
                                   sugar can be completely axiomatized, viz. role inclusions 𝑟1 ∘ · · · ∘ 𝑟𝑛 ⊑ 𝑠 and range restrictions
                                   ⊤ ⊑ ∀𝑟.𝐶,
                                5. describe how it can be implemented efficiently,
                                6. introduce variations that dispense with the computation of disjointness statements
                                   𝐶1 ⊓ · · · ⊓ 𝐶𝑛 ⊑ ⊥ or extremely large concept inclusions without practical relevance, thereby
                                   rendering the approach applicable in practice, albeit some completeness is lost,
                                7. and evaluate the implementation on real-world datasets.

                            Formal Concept Analysis (FCA) [5] is a mathematical theory that represents data as formal contexts
                         in which objects are described by their attributes. These attributes are similar to atomic statements
                         in propositional logic and unary predicates in first-order logic. The canonical implication base is a
                         complete set of implications, i.e. it entails all implications satisfied in the data [6, 7], and no complete
                         set with fewer implications exists [8, 9]. In other words, the implication base axiomatizes data in form
                         of a formal context by means of implications.
                            We employ the description logic ℰℒ++ [10], but we ignore nominals to avoid overfitting in the
                         axiomatization method and concrete domains (datatypes for strings, numbers, etc.) as no ℰℒ++ reasoner
                         currently supports them. The Web Ontology Language includes ℰℒ++ as the profile OWL 2 EL. By
                         exploiting the similarity between concept inclusions and FCA implications, a complete TBox of concept
                         inclusions can be axiomatized from observed data, i.e. which entails all concept inclusions satisfied in
                         the data. More specifically, as input we expect graph data in form of an interpretation ℐ, which includes
                         knowledge graphs, graph databases, and RDF data: the concept names are the node labels and the role
                         names are the edge labels. Preprocessing of a knowledge graph might be necessary, e.g. to correctly
                         treat the metadata as well as to materialize the modelling conventions [11]. Further given may be an
                         already existing TBox 𝒯 consisting of concept inclusions satisfied in ℐ and relative to which ℐ is to be
                         1
                          This is an extended abstract of an article [1] published in the proceedings of the 38th Annual AAAI Conference on Artificial
                          Intelligence (AAAI-24) and of its extended version [2].
                            DL 2024: 37th International Workshop on Description Logics, June 18–21, 2024, Bergen, Norway
                          $ francesco.kriegel@tu-dresden.de (F. Kriegel)
                           0000-0003-0219-0330 (F. Kriegel)
                                      © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
axiomatized, or rather, of which we compute a completion w.r.t. ℐ. The axiomatization result is called a
concept inclusion base or a completion in the following sense.

[1, Definition 4]. A TBox is complete for ℐ if it entails all concept inclusions satisfied in ℐ. A concept
inclusion base of ℐ relative to 𝒯 is a TBox ℬ of which ℐ is a model and such that ℬ ∪ 𝒯 is complete
for ℐ. We may also call ℬ a completion of 𝒯 w.r.t. ℐ as we obtain a complete TBox by adding all concept
inclusions in ℬ to 𝒯 .

   A canonical completion of 𝒯 w.r.t. ℐ can be computed in exponential time [1, Theorem 10] (see below).
This complexity result is tight since there are finite interpretations without any polynomial-size base.1
Moreover, if all concept inclusions in 𝒯 come in a particular normal form, then no base with fewer
concept inclusions exists. In order to efficiently treat cycles in the input (both contained in ℐ or induced
by 𝒯 ), the concept inclusions are formulated in an extension of ℰℒ⊥ that allows for non-tree-shaped
concept descriptions. To obtain a usual ℰℒ++ TBox, the canonical concept inclusion base afterwards
needs to be rewritten using variables.
   In order to compute the canonical completion, the given interpretation ℐ is transformed into a formal
context Kℐ . Its finite attribute set M is specifically designed such thatd theredis a concept inclusion base
involving only conjunctions over M, i.e. which consists of inclusions C⊑ D for subsets C, D ⊆ M
[1, Lemma 9]. This setting is perfectly suited for FCA since all necessary computations happen in the
top-level conjunctions and there is no need to look inside the concept descriptions provided by M
(which FCA could not do anyway). In particular, we first use an efficient FCA algorithm         ⋀︀ [13, 14]
                                                                                                         ⋀︀ to
compute the canonical base Can(Kℐ ), which consists of FCA implications Cd→ D (i.e.         d      C  →     D
in logical notation), and we then rewrite these implications into inclusions C ⊑ D to obtain a
minimal concept inclusion base of ℐ.
   Moreover, the formal context Kℐ is axiomatized relative to the implication set ℒℐ,𝒯 . On the one
hand, ℒℐ,𝒯 contains the implication {𝐸} → {𝐹 } for each two concept descriptions 𝐸, 𝐹 ∈ M with
𝐸 ⊑∅ 𝐹 , viz. to avoid the axiomatization of tautological concept inclusions. On the other hand, the
given TBox 𝒯 is taken into account by transforming all its inclusions into implications over M and
adding these to ℒℐ,𝒯 .
                                                                   ⊥
[1, Theorem 10]. For  d each finite interpretation ℐ and each ℰℒsi TBox 𝒯 of which ℐ is a model, the
TBox Can(ℐ, 𝒯 ) := Can(Kℐ , ℒℐ,𝒯 ) is a concept inclusion base of ℐ relative to 𝒯 . It is called canonical
concept inclusion base and can be computed in time that is exponential in Dom(ℐ) and polynomial in 𝒯 .
If all concept inclusions in 𝒯 have the form 𝐶 ⊑ 𝐷[ℐℐ] , then it contains the fewest inclusions among all
concept inclusion bases of ℐ relative to 𝒯 . Furthermore, there are finite interpretations that have no
polynomial-size concept inclusion base.

   In addition to concept inclusions, ℰℒ++ further supports role inclusions and range restrictions. The
latter can easily be read off from the input: for each role name 𝑟, we first compute the most specific
concept description 𝐶 that has all 𝑟-successors in ℐ as instances and we then add the range restriction
⊤ ⊑ ∀𝑟.𝐶 to the TBox.
   All role inclusions can be completely axiomatized by viewing ℐ as a finite automaton. Specifically
for objects 𝑥, 𝑦 in ℐ we denote by A𝑥,𝑦 the automaton with initial state 𝑥 and final state 𝑦. Now a role
inclusion 𝑟1 ∘ · · · ∘ 𝑟𝑛 ⊑ 𝑠 is not satisfied in ℐ iff. there are objects 𝑥, 𝑦 in ℐ with (𝑥, 𝑦) ∈ (𝑟1 ∘ · · · ∘ 𝑟𝑛 )ℐ
but (𝑥, 𝑦) ̸∈ 𝑠ℐ . By definition, (𝑥, 𝑦) ∈ (𝑟1 ∘ · · · ∘ 𝑟𝑛 )ℐ iff. the automaton A𝑥,𝑦 accepts the word 𝑟1 · · · 𝑟𝑛 .
So, the complement automaton of the union automaton of all A𝑥,𝑦 with (𝑥, 𝑦) ̸∈ 𝑠ℐ accepts the word
𝑟1 · · · 𝑟𝑛 iff. the role inclusion 𝑟1 ∘ · · · ∘ 𝑟𝑛 ⊑ 𝑠 is satisfied in ℐ. Like for the concept inclusions, we
use variables to formulate the axiomatized role inclusions: 𝑝 ∘ 𝑟 ⊑ 𝑞 for each transition (𝑝, 𝑟, 𝑞), and
𝜀 ⊑ 𝑖 for each initial state, and 𝑓 ⊑ 𝑠 for each final state. It is decidable whether there are equivalent


1
    Since each formal context can be seen as an interpretation without role names, this is an immediate corollary to a result in
    FCA: there is a sequence of formal contexts with 3·𝑛 objects and 2·𝑛+1 attributes for which the number of implications in
    their canonical implication bases is exponential in 𝑛 [12].
role inclusions without variables [15], but this seems unnecessary since most reasoners transform role
inclusions into finite automata anyway. All in all, we obtain the following main result.

[1, Theorem 13]. For each finite interpretation ℐ, a complete TBox of ℰℒ⊥ concept inclusions, range
restrictions, and role inclusions satisfied in ℐ can be computed in exponential time. There are finite
interpretations for which such a TBox cannot be of polynomial size.

   A technical limitation is that completeness does not go together with the syntactic restriction on
the interplay of role inclusions and range restrictions in an ℰℒ++ TBox 𝒯 that ensures tractable
reasoning [10]: for each role inclusion 𝑟1 ∘ · · · ∘ 𝑟𝑛 ⊑ 𝑠 in 𝒯 where 𝑛 ≥ 1, if 𝒯 does not entail the range
restriction ⊤ ⊑ ∀𝑟𝑛 .𝐶, then 𝒯 neither entails ⊤ ⊑ ∀𝑠.𝐶 (i.e. new concept memberships for objects in
the range of 𝑠 are forbidden). This restriction might not be satisfied by a TBox that is complete for both
role inclusions and range restrictions [2, Example XXIII].
   To ensure that the TBox is within ℰℒ++ , we could weaken the range restrictions: for each role name 𝑠,
we obtain a suitable range restriction ⊤ ⊑ ∀𝑠.𝐶 by computing the most specific concept description 𝐶
that has all 𝑠-successors in ℐ as instances (as above) but also all 𝑟-successors for each role name 𝑟
leading to a final state (since these represent role inclusions · · · ∘ 𝑟 ⊑ 𝑠).
   Alternatively, we could remove all role inclusions that contribute to a violation of the syntactic
restriction (by simply computing a language difference in the above automaton representation) and
leave the range restrictions unchanged. To sum up, only two of the following goals can be achieved:
the base satisfies the syntactic restriction, the base is complete for all range restrictions, the base is
complete for all role inclusions.
   Regarding efficient implementation, it is important to reduce the input interpretation by grouping
together all objects that cannot be distinguished by any concept description. Thereby the interpretation
can be made significantly smaller and all subsequent steps need less time. However, in first experiments
several computations did not finish due to extremely large concept descriptions used in the concept
inclusion base to ensure completeness. We conjecture that such huge parts in the concept inclusion base
do not have practical relevance or suffer from overfitting, and thus we added parameters (conjunction
size limit and role depth bound) to dispense with the computation of these irrelevant huge parts but
also to control the loss of completeness. Experiments further revealed that often more than half of the
computation time is required for generating disjointness statements 𝐶1 ⊓ · · · ⊓ 𝐶𝑛 ⊑ ⊥. We explain
how intermediate computation steps can be stopped early in order to avoid computing them.
   We implemented the method in the programming language Scala and we evaluated the prototype with
the plethora of ontologies from real-world applications used in the ORE 2015 Reasoner Competition.
This collection is split into OWL 2 EL and OWL 2 DL ontologies. The former are all expressible in
ℰℒ++ . For the latter, we syntactically transform as many axioms as possible into ℰℒ++ and remove the
others. The goal then was to compute, for each such ontology, the completion of the TBox 𝒯 w.r.t. the
interpretation ℐ obtained by viewing the respective ABox under closed-world assumption. To ensure
that ℐ is a model of 𝒯 , we saturate ℐ by means of the inclusions in 𝒯 if necessary. Altogether we
obtained 614 test datasets with up to 747,998 objects, of which 446 (72.64 %) are acyclic. The average
number of triples per object varies from slightly over 0 up to 25.39. Computation of concept inclusion
bases finished for all reduced datasets with no more than 100 objects. For reduced datasets with up
to 1,000 objects, the first errors due to insufficient computing resources occurred without restrictions.
Between 1,000 and 10,000 objects, computations failed without restrictions, but otherwise succeeded
in the majority of cases. Reduced datasets with more than 10,000 objects could only sometimes be
axiomatized with very restricted settings, given 8 hours time and 80 GB memory on a twelve-year-old
computer server. However, we did not implement the rewriting of the concept inclusion base into
ℰℒ++ , nor the axiomatization of role inclusions and range restrictions.
   That the theoretical approach itself can be extended to more expressive description logics has already
been proven [16, 17], but it is unclear whether such an extended approach can still be efficiently
implemented and used in practice. From the perspective of the underlying article [1], this seems possible
for description logics characterized by simulations, e.g. ℰℒℐ or Horn-𝒜ℒ𝒞.
   An interesting question for future research would be whether one can give any kind of completeness
guarantee if a conjunction size limit is used, as already done for the role depth bound [18]. A smaller
task would be to investigate how role inclusions and range restrictions can be integrated into the
background knowledge after they have been computed but prior to axiomatizing the concept inclusions,
preferably yielding an overall minimal base.
   Furthermore, the computation can be speed-up with even faster FCA algorithms for enumerating
closures. The employed FCA algorithm [13, 14] is currently the fastest algorithm for computing the
canonical implication base, but it is unfortunately only single-threaded. Developing a multi-threaded
variant is thus another future goal. It might already help to change its depth-first behaviour. Apart
from that one could use a faster programming language (like C++), more computation time, a faster
computer server, or optimize the prototype.
   A concept inclusion 𝐶 ⊑ 𝐷 is confident if the ratio |(𝐶 ⊓ 𝐷)ℐ |/|𝐶 ℐ | exceeds a pre-defined limit
but need not be 100 %. A confident inclusion base extends the canonical inclusion base [19], and the
prototype could be easily upgraded as it already computes the necessary pieces.
   We have not considered keys supported by the OWL 2 EL profile. Learning of keys from RDF data
using FCA has been addressed [20–22]. To apply this approach to description logic and OWL it must be
extended towards concept descriptions in place of concept names (RDF classes).


Acknowledgments
This work has been supported by Deutsche Forschungsgemeinschaft (DFG) in Projects 430150274
(Repairing Description Logic Ontologies) and 389792660 (TRR 248: Foundations of Perspicuous Software
Systems), and has further been supported by the Saxon State Ministry for Science, Culture, and Tourism
(SMWK) by funding the Center for Scalable Data Analytics and Artificial Intelligence (ScaDS.AI).


References
 [1] Francesco Kriegel. Efficient Axiomatization of OWL 2 EL Ontologies from Data by means of
     Formal Concept Analysis. In: Proc. of AAAI. 2024. doi: 10.1609/aaai.v38i9.28930.
 [2] Francesco Kriegel. Efficient Axiomatization of OWL 2 EL Ontologies from Data by means of Formal
     Concept Analysis (Extended Version). LTCS-Report 23-01. Technische Universität Dresden, 2023.
     doi: 10.25368/2023.214. See also the addendum: 10.5281/zenodo.10908141.
 [3] Franz Baader, Felix Distel. A Finite Basis for the Set of ℰℒ-Implications Holding in a Finite Model.
     In: Proc. of ICFCA. 2008, pp. 46–61. doi: 10.1007/978-3-540-78137-0_4.
 [4] Franz Baader, Felix Distel. Exploring Finite Models in the Description Logic ℰℒgfp . In: Proc. of
     ICFCA. 2009, pp. 146–161. doi: 10.1007/978-3-642-01815-2_12.
 [5] Bernhard Ganter, Rudolf Wille. Formal Concept Analysis – Mathematical Foundations. 1999. doi:
     10.1007/978-3-642-59830-2.
 [6] Jean-Luc Guigues, Vincent Duquenne. Famille minimale d’implications informatives résultant
     d’un tableau de données binaires. In: Mathématiques et Sciences Humaines 95 (1986), pp. 5–18.
     url: http://www.numdam.org/item/MSH_1986__95__5_0.pdf.
 [7] Gerd Stumme. Attribute Exploration with Background Implications and Exceptions. In: 1996,
     pp. 457–469. doi: 10.1007/978-3-642-80098-6_39.
 [8] Marcel Wild. A Theory of Finite Closure Spaces Based on Implications. In: Advances in Mathe-
     matics 108.1 (1994), pp. 118–139. doi: 10.1006/aima.1994.1069.
 [9] Felix Distel. Learning description logic knowledge bases from data using methods from formal
     concept analysis. Doctoral thesis. Technische Universität Dresden, 2011. url: https : / / nbn -
     resolving.org/urn:nbn:de:bsz:14-qucosa-70199.
[10]   Franz Baader, Sebastian Brandt, Carsten Lutz. Pushing the ℰℒ Envelope Further. In: Proc. of
       OWLED. 2008. url: http://ceur-ws.org/Vol-496/owled2008dc_paper_3.pdf.
[11]   Markus Krötzsch. Too Much Information: Can AI Cope with Modern Knowledge Graphs? In:
       Proc. of ICFCA. 2019, pp. 17–31. doi: 10.1007/978-3-030-21462-3_2.
[12]   Sergei O. Kuznetsov. On the Intractability of Computing the Duquenne-Guigues Base. In: J.
       Univers. Comput. Sci. 10.8 (2004), pp. 927–933. doi: 10.3217/JUCS-010-08-0927.
[13]   Radek Janoštík, Jan Konečný, Petr Krajča. LinCbO: Fast algorithm for computation of the
       Duquenne-Guigues basis. In: Inf. Sci. 572 (2021), pp. 223–240. doi: 10.1016/j.ins.2021.04.104.
[14]   Radek Janoštík, Jan Konečný, Petr Krajča. Pruning techniques in LinCbO for the computation of
       the Duquenne-Guigues basis. In: Inf. Sci. 616 (2022), pp. 182–203. doi: 10.1016/j.ins.2022.10.057.
[15]   Walter Bucher, Johann Hagauer. It is Decidable Whether a Regular Language is Pure Context-Free.
       In: Theor. Comput. Sci. 26 (1983), pp. 233–241. doi: 10.1016/0304-3975(83)90088-9.
[16]   Francesco Kriegel. Acquisition of Terminological Knowledge from Social Networks in Description
       Logic. In: Formal Concept Analysis of Social Networks. 2017, pp. 97–142. doi: 10.1007/978-3-319-
       64167-6_5.
[17]   Francesco Kriegel. Joining Implications in Formal Contexts and Inductive Learning in a Horn
       Description Logic. In: Proc. of ICFCA. 2019, pp. 110–129. doi: 10.1007/978-3-030-21462-3_9.
[18]   Daniel Borchmann, Felix Distel, Francesco Kriegel. Axiomatisation of general concept inclusions
       from finite interpretations. In: J. Appl. Non Class. Logics 26.1 (2016), pp. 1–46. doi: 10.1080/
       11663081.2016.1168230.
[19]   Daniel Borchmann. Towards an Error-Tolerant Construction of ℰℒ⊥ -Ontologies from Data Using
       Formal Concept Analysis. In: Proc. of ICFCA. 2013, pp. 60–75. doi: 10.1007/978-3-642-38317-5_4.
[20]   Manuel Atencia, Jérôme David, Jérôme Euzenat, Amedeo Napoli, Jérémy Vizzini. Link key
       candidate extraction with relational concept analysis. In: Discret. Appl. Math. 273 (2020), pp. 2–20.
       doi: 10.1016/j.dam.2019.02.012.
[21]   Nacira Abbas, Alexandre Bazin, Jérôme David, Amedeo Napoli. Non-Redundant Link Keys in
       RDF Data: Preliminary Steps. In: Proc. of FCA4AI. 2021, pp. 125–130. url: https://ceur-ws.org/Vol-
       2972/paper12.pdf.
[22]   Nacira Abbas, Alexandre Bazin, Jérôme David, Amedeo Napoli. A Study of the Discovery and
       Redundancy of Link Keys Between Two RDF Datasets Based on Partition Pattern Structures. In:
       Proc. of CLA. 2022, pp. 175–189. url: https://ceur-ws.org/Vol-3308/Paper14.pdf.

</pre>