Introduction

Multiple Inheritance of Ontology Concepts in a Semantic-Aware Encoding Scheme

Weiqin Xu

0 1

Olivier Cure

Philippe Calvez

philippe.calvez1@engie.com 0 0 ENGIE CRIGEN CSAI Lab 1 LIGM (UMR 8049) , CNRS, UPEM, F-77454, Marne-la-Vallee

Litemat[1] is semantic-aware encoding scheme that was originally designed to support e cient inferences over the RDFS ontology language. Intuitively, it encodes, with integer values, the elements of the TBox, i.e., concepts and properties, in such a way that reasoning over these hierarchies can be reduced to some computation over identi er intervals. It has recently been extended to support owl:sameAs[4], owl:inverseOf and owl:transitiveProperty[2] by proposing a similar form of encoding for elements of the ABox, i.e., individuals. We can hence consider that Litemat now supports inferences in the ontology language frequently referred to as RDFS++. The main advantages provided by this encoding scheme is to considerably reduce the need for inference materialization and to provide an adapted query rewriting and optimization approach that improve the performance of query answering. Considering materialization, Litemat eliminates the need to store inferred facts since the semantics of attributed identi ers to concepts, properties and individuals are su cient for RDFS++ reasoning services. Moreover, Litemat's query rewriting approach replaces costly union of SPARQL basic graph patterns (BGP) with query lters over integer intervals. Litemat hence proposes a trade-o between the two most popular inference solutions encountered in RDF stores. These nice properties only come with an overhead of maintaining slightly larger dictionaries, due to larger identi er values and additional metadata, than the standard case of most RDF stores. In this paper, we present Litemat's support for multiple inheritance as well as the query processing and optimization that comes with it. 3 In this section, we present LiteMat's encoding scheme using the following ontology concept hierarchy (but this method can also be applied to property hierarchies): A v >, B v >, C v >, D v >, A1 v A, C1 v C, F 1 v F , E1 v E, E2 v E, F v B, F v C, E v A1, E v B and E v C.

Introduction LiteMat's encoding approach and query processing

We can see that the last 5 concept subsumptions describe some multiple inheritance, i.e., the concepts E (resp. F ) have 3 (resp. 2) super classes.

In Figure 1(a), we present LiteMat's encoding for this TBox. In a rst step, the assignment of an identi er, using a binary representation, for each concept is performed in a top-down recursive manner, i.e., it starts by setting the top concept (>) at 1, and proceeds level-wise on the element hierarchy until all leaves have been processed. Intuitively, for each concept , we count the number N of direct sub concepts (including ), e.g., in our running example > has 5 direct sub concepts. At this level, d log(N) e provides the number of bits necessary to represent each sub concept. Then, these sub concepts (excluding ) are pre xed by the binary identi er of and uniquely get a binary representation of a value 2 [1; N 1]. For instance, the concept A is pre xed with '1' (>'s identi er and is assigned the value 1 on 3 bits, i.e., '001', yielding the binary string '1001').

Finally, a normalization step makes sure that all identi ers are encoded on the same binary string length. This is performed as follows: once all concepts have been encoded in the rst step, we get the size L of the longest encoding string (i.e., 8 in our running example). Then all concept identi ers with an encoding length lower than L are appended with '0' until their length reaches L. This normalization step is represented with red '0' in Figure 1(a). The last column of this gure provides the integer identi er corresponding to each concept.

With this encoding scheme, it is clear that multiple inheritance poses a problem since each ontology concept must have a single identi er. For instance, in our running example, we can provide three di erent identi ers to E: one computed from A1, another one from B and a last one from C. In the next section, we propose a solution to this issue.

LiteMat proposes an e cient query processing approach that takes advantage of the semantic-aware encoding. In fact, whenever a query requires to reason over the concept or property hierarchies, the system simply introduces new variables that are ltered on the identi ers of our encoding scheme.

Consider the following BGP of a SPARQL query: f?x type Eg. It is generally rewritten into f?x type Eg UNION f?x type E1g UNION f?x type E2g. In LiteMat, the system would identify that E has several sub concepts, replace E with a new variable and add a FILTER clause that restricts the accepted values of this new variable. This restriction corresponds to an interval of integer values where the lower bound is the identi er of E and the upper bound is easily computing (i.e., using 2 bit shift operations and an addition) from the identi er of E. This computation requires an identi er metadata stating the index on the bit string where the normalization has started. Considering that the identi ers of E, E1 and E2 are resp. 156, 157 and 158 (to be explained in the next section), our LiteMat rewriting would be: f?x type ?y. FILTER (?y 156 && ?y<157)g. This rewriting also applies when sub property relationships are used and the more complex the query, the more e cient our rewriting.

Multiple inheritance support in LiteMat Encoding scheme

Our support of multiple inheritance is based on the notion of a representative. A representative Cr is selected among the super concepts C1,..,Cn of a concept SC. That is the integer identi er of SC will be computed following Litemat's approach based on the Cr identi er. It is obvious that with this approach SC loses any connection to its non-representative direct and indirect super concepts. Hence it is necessary to keep track of these super concepts. In the following, the remaining super classes of SC, i.e., fC1,..,Cngn Cr are considered as nonrepresentative of SC.

Let consider that the representative of the concept E is A1. Hence, the nonrepresentatives of E are the concepts B and C. In Figure 1(b)(where each concept has its identi er in subscript), a representative is pointed by a dashed arrow and we can observe that the identi ers of concepts E, E1 and E2 (resp. 156, 157 and 158) are computed from A1's identi er (i.e., 152).

To support an e cient query processing, we require a key/value data structure, denoted nonRep. Intuitively, each non representative of the ontology is an entry in that data structure and the value associated to a key corresponds to a set containing all the sub concepts involved in a multiple inheritance with the key concept as one of its super concept. In our running example, nonRep(B) = fEg and nonRep(C) = fE,Fg. 3.2

Query processing and optimization

The query processing presented in Section 2 is extended to produce complete and correct result sets for ontologies involving multiple inheritance. The extension coincides to the addition of disjunction in the generated FILTER clause of the rewritten SPARQL queries. Like in the original rewriting approach of Section 2, the disjuncts correspond to interval descriptions for a given query variable. An interval is computed using the nonRep data structure. Note that queries involving representatives do not necessarily involve this form of query processing. Due to space limitation, we present this processing and a simple optimization through the following example.

We consider the following BGP of a simple query, denoted Q, which involves a multiple inheritance: f?x rdf:type Cg. The query rewriting denoted Q0 corresponds to: f ?x rdf:type ?y. lter((?y 176 && ?y<192) k (?y 156 && ?y<160) k (?y 168 && ?y<176)) where the rst disjunct corresponds to the standard interval de ned in Section 2 and last two are computed using the nonRep data structures. That is, we search whether C is involved in a multiple inheritance by checking its presence as a key in nonRep. If it is the case, it will return a set of concepts and for each of these concepts a disjunct is added to the FILTER clause over that variable. In our running example, nonRep(C) returns a set with concepts E and F , resp. the identi ers 156 and 168. These values correspond to the lower bounds of the intervals and upper bounds are computer as stated in LiteMat.

Based on the intervals present in this FILTER clause, a simple optimization can be performed. It has the possible e ect of reducing the number of disjuncts in a query rewriting. The optimized queries are Q00 : f?x 0 ?y. lter((?y 168 && ?y<192) k (?y 156 && ?y<160)), i.e., it contains one less disjunct. 4

Conclusion

An encoding scheme, based on the semantic-aware approach of LiteMat, has been extended to support multiple inheritance. We emphasized that it can be used to e ciently to answer SPARQL queries. The overhead of the encoding scheme is relatively low and as been used in use case involving static and streaming data[ 3 ].

Olivier

Cure , Hubert Naacke, Tendry Randriamalala, and Bernd Amann. LiteMat: A scalable, cost-e cient inference encoding scheme for large RDF graphs . In 2015 IEEE International Conference on Big Data, Big Data 2015 , pages 1823 { 1830 , 2015 .

Olivier

Cure , Weiqin Xu,

Hubert

Naacke , and

Philippe

Calvez . Extending LiteMat toward RDFS++ . In Extended Semantic Web Conference, LASCAR workshop, 2019 .

Jeremy

Lhez , Xiangnan Ren, Badre Belabbess, and

Olivier

Cure . A compressed, inference-enabled encoding scheme for RDF stream processing . In The Semantic Web - 14th International Conference, ESWC 2017 , pages 79 { 93 , 2017 .

Xiangnan

Ren , Olivier Cure, Hubert Naacke, Jeremy Lhez,

and Li

Ke . Striderr: Massive and distributed RDF graph stream reasoning . In 2017 IEEE International Conference on Big Data, BigData 2017 , pages 3358 { 3367 , 2017 .