Multiple Inheritance of Ontology Concepts in a Semantic-Aware Encoding Scheme Weiqin Xu1,2 , Olivier Curé1 , Philippe Calvez2 1 LIGM (UMR 8049), CNRS, UPEM, F-77454, Marne-la-Vallée, France.{firstname.lastname}@u-pem.fr 2 ENGIE CRIGEN CSAI Lab, philippe.calvez1@engie.com 1 Introduction Litemat[1] is semantic-aware encoding scheme that was originally designed to support efficient inferences over the RDFS ontology language. Intuitively, it en- codes, with integer values, the elements of the TBox, i.e., concepts and proper- ties, in such a way that reasoning over these hierarchies can be reduced to some computation over identifier intervals. It has recently been extended to support owl:sameAs[4], owl:inverseOf and owl:transitiveProperty[2] by proposing a similar form of encoding for elements of the ABox, i.e., individuals. We can hence consider that Litemat now supports inferences in the ontology language frequently referred to as RDFS++. The main advantages provided by this encoding scheme is to considerably reduce the need for inference materialization and to provide an adapted query rewriting and optimization approach that improve the performance of query answering. Considering materialization, Litemat eliminates the need to store inferred facts since the semantics of attributed identifiers to concepts, prop- erties and individuals are sufficient for RDFS++ reasoning services. Moreover, Litemat’s query rewriting approach replaces costly union of SPARQL basic graph patterns (BGP) with query filters over integer intervals. Litemat hence proposes a trade-off between the two most popular inference solutions encountered in RDF stores. These nice properties only come with an overhead of maintaining slightly larger dictionaries, due to larger identifier values and additional metadata, than the standard case of most RDF stores. In this paper, we present Litemat’s support for multiple inheritance as well as the query processing and optimization that comes with it. 3 2 LiteMat’s encoding approach and query processing In this section, we present LiteMat’s encoding scheme using the following ontol- ogy concept hierarchy (but this method can also be applied to property hierar- chies): A v >, B v >, C v >, D v >, A1 v A, C1 v C, F 1 v F , E1 v E, E2 v E, F v B, F v C, E v A1, E v B and E v C. 3 Copyright c 2019 for this paper by its authors. Use permitted under Creative Com- mons License Attribution 4.0 International (CC BY 4.0). We can see that the last 5 concept subsumptions describe some multiple inheritance, i.e., the concepts E (resp. F ) have 3 (resp. 2) super classes. In Figure 1(a), we present LiteMat’s encoding for this TBox. In a first step, the assignment of an identifier, using a binary representation, for each concept is performed in a top-down recursive manner, i.e., it starts by setting the top concept (>) at 1, and proceeds level-wise on the element hierarchy until all leaves have been processed. Intuitively, for each concept α, we count the number N of direct sub concepts (including α), e.g., in our running example > has 5 direct sub concepts. At this level, d log(N) e provides the number of bits necessary to represent each sub concept. Then, these sub concepts (excluding α) are prefixed by the binary identifier of α and uniquely get a binary representation of a value ∈ [1, N − 1]. For instance, the concept A is prefixed with ’1’ (>’s identifier and is assigned the value 1 on 3 bits, i.e., ’001’, yielding the binary string ’1001’). Finally, a normalization step makes sure that all identifiers are encoded on the same binary string length. This is performed as follows: once all concepts have been encoded in the first step, we get the size L of the longest encoding string (i.e., 8 in our running example). Then all concept identifiers with an encoding length lower than L are appended with ’0’ until their length reaches L. This normalization step is represented with red ’0’ in Figure 1(a). The last column of this figure provides the integer identifier corresponding to each concept. With this encoding scheme, it is clear that multiple inheritance poses a prob- lem since each ontology concept must have a single identifier. For instance, in our running example, we can provide three different identifiers to E: one computed from A1, another one from B and a last one from C. In the next section, we propose a solution to this issue. Fig. 1. LiteMat Encoding LiteMat proposes an efficient query processing approach that takes advantage of the semantic-aware encoding. In fact, whenever a query requires to reason over the concept or property hierarchies, the system simply introduces new variables that are filtered on the identifiers of our encoding scheme. Consider the following BGP of a SPARQL query: {?x type E}. It is generally rewritten into {?x type E} UNION {?x type E1} UNION {?x type E2}. In LiteMat, the system would identify that E has several sub concepts, replace E with a new variable and add a FILTER clause that restricts the accepted values of this new variable. This restriction corresponds to an interval of integer values where the lower bound is the identifier of E and the upper bound is easily computing (i.e., using 2 bit shift operations and an addition) from the identifier of E. This computation requires an identifier metadata stating the index on the bit string where the normalization has started. Considering that the identifiers of E, E1 and E2 are resp. 156, 157 and 158 (to be explained in the next section), our LiteMat rewriting would be: {?x type ?y. FILTER (?y≥156 && ?y<157)}. This rewriting also applies when sub property relationships are used and the more complex the query, the more efficient our rewriting. 3 Multiple inheritance support in LiteMat 3.1 Encoding scheme Our support of multiple inheritance is based on the notion of a representative. A representative Cr is selected among the super concepts C1 ,..,Cn of a concept SC. That is the integer identifier of SC will be computed following Litemat’s approach based on the Cr identifier. It is obvious that with this approach SC loses any connection to its non-representative direct and indirect super concepts. Hence it is necessary to keep track of these super concepts. In the following, the remaining super classes of SC, i.e., {C1 ,..,Cn }\ Cr are considered as non- representative of SC. Let consider that the representative of the concept E is A1. Hence, the non- representatives of E are the concepts B and C. In Figure 1(b)(where each concept has its identifier in subscript), a representative is pointed by a dashed arrow and we can observe that the identifiers of concepts E, E1 and E2 (resp. 156, 157 and 158) are computed from A1’s identifier (i.e., 152). To support an efficient query processing, we require a key/value data struc- ture, denoted nonRep. Intuitively, each non representative of the ontology is an entry in that data structure and the value associated to a key corresponds to a set containing all the sub concepts involved in a multiple inheritance with the key concept as one of its super concept. In our running example, nonRep(B) = {E} and nonRep(C) = {E,F}. 3.2 Query processing and optimization The query processing presented in Section 2 is extended to produce complete and correct result sets for ontologies involving multiple inheritance. The extension coincides to the addition of disjunction in the generated FILTER clause of the rewritten SPARQL queries. Like in the original rewriting approach of Section 2, the disjuncts correspond to interval descriptions for a given query variable. An interval is computed using the nonRep data structure. Note that queries involving representatives do not necessarily involve this form of query processing. Due to space limitation, we present this processing and a simple optimization through the following example. We consider the following BGP of a simple query, denoted Q, which involves a multiple inheritance: {?x rdf:type C}. The query rewriting denoted Q0 corre- sponds to: { ?x rdf:type ?y. filter((?y≥176 && ?y<192) k (?y≥156 && ?y<160) k (?y≥168 && ?y<176)) where the first disjunct corresponds to the standard interval defined in Section 2 and last two are computed using the nonRep data structures. That is, we search whether C is involved in a multiple inheritance by checking its presence as a key in nonRep. If it is the case, it will return a set of concepts and for each of these concepts a disjunct is added to the FILTER clause over that variable. In our running example, nonRep(C) returns a set with concepts E and F , resp. the identifiers 156 and 168. These values correspond to the lower bounds of the intervals and upper bounds are computer as stated in LiteMat. Based on the intervals present in this FILTER clause, a simple optimization can be performed. It has the possible effect of reducing the number of disjuncts in a query rewriting. The optimized queries are Q00 : {?x 0 ?y. filter((?y≥168 && ?y<192) k (?y≥156 && ?y<160)), i.e., it contains one less disjunct. 4 Conclusion An encoding scheme, based on the semantic-aware approach of LiteMat, has been extended to support multiple inheritance. We emphasized that it can be used to efficiently to answer SPARQL queries. The overhead of the encoding scheme is relatively low and as been used in use case involving static and streaming data[3]. References 1. Olivier Curé, Hubert Naacke, Tendry Randriamalala, and Bernd Amann. LiteMat: A scalable, cost-efficient inference encoding scheme for large RDF graphs. In 2015 IEEE International Conference on Big Data, Big Data 2015, pages 1823–1830, 2015. 2. Olivier Curé, Weiqin Xu, Hubert Naacke, and Philippe Calvez. Extending LiteMat toward RDFS++. In Extended Semantic Web Conference, LASCAR workshop, 2019. 3. Jérémy Lhez, Xiangnan Ren, Badre Belabbess, and Olivier Curé. A compressed, inference-enabled encoding scheme for RDF stream processing. In The Semantic Web - 14th International Conference, ESWC 2017, pages 79–93, 2017. 4. Xiangnan Ren, Olivier Curé, Hubert Naacke, Jérémy Lhez, and Li Ke. Striderr : Massive and distributed RDF graph stream reasoning. In 2017 IEEE International Conference on Big Data, BigData 2017, pages 3358–3367, 2017.