Accessing Document Data Sources using Referring Expression Types Alexander Borgida1 , Enrico Franconi2 , David Toman3 and Grant Weddell3 1 Department of Computer Science, Rutgers University, New Brunswick, US 2 KRDB Research Centre for Knowledge and Data, Free University of Bozen-Bolzano, Italy 3 Cheriton School of Computer Science, University of Waterloo, Waterloo, Canada Abstract We show how JSON documents can be abstracted as concept descriptions in an appropriate description logic. This representation allows the use of additional background knowledge in the form of a TBox and an assignment of referring expression types (RETs) to certain primitive concepts to detect situations in which subdocuments, perhaps multiple subdocuments located in various parts of the original documents, capture information about a particular conceptual entity. Detecting such situations allows for normalizing the JSON document into several separate documents that capture all information about such conceptual entities in separate documents. This transformation preserves all the original information present in the input documents. The RET assignment contributes a set of possible concept descriptions that enable more refined and normalized capture of documents, and to more crafted answers to queries that adhere to user expectations expressed as RETs. We also show how RETs allow checking for a document admissibility condition ensuring that each document describes a single conceptual entity. 1. Introduction and Motivation Suppose we have a JSON/mongoDB document, and an ontology attaching semantics to (most) fields in JSON objects (called “keys” in the JSON definition; for the rest of this paper we reserve the word “key” for database-like keys). More concretely we treat the fields in JSON objects as (functional) roles in an underlying DL and use a TBox to add additional appropriate concept identifiers for the domain of the document. For example, for JSON document { "fname": "John", "lname": "Smith", "age": 25, "wife": { "fname" : "Mary", "lname": "Smith" } } the TBox might contain subsumptions stating, e.g., that: • PERSONs are objects that posses fname and lname fields, • fname and lname form a key for PERSONs, • a wife of a PERSON is also a PERSON, and so on. A formal way of capturing these constraints in our FunDL dialect is given in Example 3 below. DL 2022: 35th International Workshop on Description Logics, August 7–10, 2022, Haifa, Israel $ borgida@cs.rutgers.edu (A. Borgida); franconi@inf.unibz.it (E. Franconi); david@uwaterloo.ca (D. Toman); gweddell@uwaterloo.ca (G. Weddell) © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) We are interested in asking conjunctive queries over the document, with answers being returned in some “answer language” ℒ𝑎𝑛𝑠𝑤𝑒𝑟 . We could simply create distinct identifiers for each node in the JSON tree, and use roles (obtained from the fields of JSON objects) to connect them, resulting in an ABox, which in the above case might contain an assertion (id434,25) : age. In this case, a simple query like q(P) :- age(P,25) would return id434, which is quite meaningless without laboriously tracing back the creation of the ABox; and this answer makes no connection with the knowledge in the ontology. A more desirable answer would use the TBox, and return something like PERSON with fname=”John” and lname=”Smith”. This issue, and many others concerning the appropriate choice of description for objects in query answers was first addressed in [1], where the notion of “singular referring expression” was introduced. It stood for a concept description that was guaranteed to denote a single individual. Given the use of a referring expression in ℒ𝑎𝑛𝑠𝑤𝑒𝑟 to, among others, avoid meaningless object ids, another paper [2] suggested doing a similar thing in the ℒ𝑡𝑒𝑙𝑙 : replacing the ABox with a “CBox”, where singular referring expressions are used to state assertional facts as concepts the KB knows about. For example, the JSON fragment above could be captured by the concept ∃ fname.{"John"} ⊓ ∃ lname.{"Smith"} ⊓ ∃ age.{"25"} ⊓ ∃ wife.(∃ fname.{"Mary"} ⊓ ∃ lname.{"Smith"}) Here, nominal concepts play a central role, as do key constraints for singularity, and a general rule that CBox concepts cannot have empty interpretations. An algorithm was given in [2] for computing certain answers to conjunctive queries, which returns singular referring expressions appearing in the CBox. This work was carried out in a dialect of the FunDL family of description logics [3]. FunDL and its dialects replace roles with (possibly partial) unary functions called features, reify general roles, and always include a so-called path functional dependency (PFD) concept constructor to express constraints that are equality generating (including keys). In particular, the dialect used for these purposes in [2] has the property that DL reasoning and query answering are both polynomial-time (in the size of the underlying knowledge base). The above paper also gave, as an example of CBox use, one way to represent a JSON document (in particular, for a document in a MongoDB collection), as a singular concept description in FunDL. Example 1 (from [2]). Consider the case where PERSON is the name of a MongoDB JSON collection with a value as given in Figure 1. The PERSON document is captured by a CBox containing the following concept description expressed in terms of FunDL. ∃ collection.{"person"} ⊓ ∃ data.∃ dom− .∃ ran( ∃ fname.{"John"} ⊓ ∃ lname.{"Smith"} ⊓ ∃ age.{"25"} ⊓ ∃ wife.∃ fname.{"Mary"} ⊓ ∃ phone.∃ dom− .∃ ran(∃ colour.{"red"} ⊓ ∃ dnum.{"212 555-1234"})) ⊓ ∃ dom− .∃ ran( ∃ fname.{"Mary"} ⊓ ∃ lname.{"Jones"} ⊓ ∃ salary.{"$150000CAD"} ⊓ ∃ spouse.∃ fname.{"John"} ⊓ ∃ phone.∃ dom− .∃ ran(∃ loc.{"home"} ⊓ ∃ dnum.{"212 555-1234"}) ⊓ ∃ dom− .∃ ran(∃ loc.{"work"} ⊓ ∃ dnum.{"212 666-4567"})) □ { "collection": "person", "data" : [ { "fname": "John", "lname": "Smith", "age": 25, "wife": { "fname" : "Mary" }, "phone": [ {"colour": "red", "dnum": "212 555-1234"} ] } , { "fname": "Mary", "lname": "Jones", "salary": "$150,000 (CAD)", "spouse": { "fname": "John" }, "phone": [ {"loc": "home", "dnum": "212 555-1234"}, {"loc": "work", "dnum": "212 666-4567"} ] } ] } Figure 1: JSON PERSON Document. Intuitively, JSON values are mapped to concepts as follows: (a) primitive values to nominals, (b) (compound) objects to conjunctions of existential restrictions of features corresponding to the field names of the (mapping of) values, and (3) arrays (treated as sets) to multi-valued roles reified via the features dom and ran (see Definition 4). One can infer that a document would have the above-mentioned singularity property by asserting in a FunDL TBox that collection names (person in our case) are unique. Since in our case query answers are elements of the CBox, the above translation of a full JSON document into a single CBox entry is undesirable. For this reason, we propose to break it up into smaller conceptual entities that make sense based on the terminology in the TBox. This is achieved by using referring expression types (RETs) introduced in [1]. In particular, we will show how a combination of a TBox and an RET assignment to the primitive concepts occurring in the TBox enable mapping an initial CBox directly obtained from a MongoDB database, as illustrated above, to an alternative normalized CBox, as now illustrated. Example 2 (JSON normalization). Applying our normalization procedure to the CBox in Example 1 will then obtain the following concepts: DOCUMENT ⊓ ∃ collection.{"person"} ⊓ ∃ data(∃ dom− .∃ ran(PERSON ⊓ ∃ fname.{"John"} ⊓ ∃ lname.{"Smith"}) ⊓ ∃ dom− .∃ ran(PERSON ⊓ ∃ fname.{"Mary"} ⊓ ∃ lname.{"Jones"})) PERSON ⊓ ∃ fname.{"John"} ⊓ ∃ lname.{"Smith"} ⊓ ∃ age.{"25"} ⊓ ∃ wife.∃ fname{"Mary"} ⊓ ∃ phone.∃ dom− .∃ ran(PHONE ⊓ ∃ dnum{"212 555-1234"}) PERSON ⊓ ∃ fname.{"Mary"} ⊓ ∃ lname.{"Jones"} ⊓ ∃ salary.{"$150000CAD"} ⊓ ∃ spouse.∃ fname{"John"} ⊓ ∃ phone.∃ dom− .∃ ran(PHONE ⊓ ∃ dnum{"212 555-1234"}) ⊓ ∃ dom− .∃ ran(PHONE ⊓ ∃ dnum{"212 666-4567"})) PHONE ⊓ ∃ dnum{"212 555-1234"} ⊓ ∃ loc.{"home"} ⊓ ∃ colour.{"red"} PHONE ⊓ ∃ dnum{"212 555-4567"} ⊓ ∃ loc.{"work"} In the above, the underlined parts of these concepts serve as referring expressions identifying entities, while the remainder of the concept tells us facts about the entities, using (the dash-underlined) referring expressions when needed, to record such facts. For example, observe how references to phone entities in phone facts about persons require only dnum facts about phones. □ Our contributions are as follows. 1. We show how a JSON document (or a MongoDB collection) can be abstracted as a concept description, as illustrated in Example 1, and contrast this way of attaching a meaning or semantics with other proposals. 2. We show how an TBox can attach meaning to such concept descriptions (and therefore to the JSON documents). 3. We describe the normalization procedure which uses the given TBox and a referring expression type assignment in order to extract additional intuitively reasonable CBox subconcepts, as illustrated in Example 2. 4. We also show how information about the same entity can be consolidated even if it was originally recorded in different parts of the JSON document. 5. Finally, we present a more effective way of diagnosing an admissibility property of a CBox that ensures interpretations of referring expressions are indeed singular. All of the above tasks are accomplished by relying solely on reasoning about equalities and concept memberships of objects corresponding to values in the input JSON document with respect to the DL TBox. This sets our approach apart from many other approaches that commonly rely on mapping and transformation rules. The paper is organized as follows: Section 2 provides the needed background relating to FunDL and to referring expressions. Section 3 then outlines the main results of this paper relating to the ability to identify subdocuments relating to identifiable entities and subsequent separation of these entities in separate CBox entries/documents. We conclude with a brief overview of related work and with suggestions for follow-on research. 2. Definitions and Background We now formally define the artifacts introduced in our introductory comments, beginning with a general definition of concept descriptions for members of the FunDL family of DLs with PTIME complexity of logical consequence.1 Recall that members of this family replace roles with partial functions, and that concept descriptions not only occur in a TBox but also serve as referring expressions in a CBox. Definition 1 (FunDL Concepts, Referring Expressions and Knowledge Bases). Let F and PC be sets of feature names and primitive concept names, respectively. A path expression is defined by 1 Some additional conditions must hold on PFDs and on conjunctions; see [3, 4] for details. Syntax Semantics: Defn of “(·)ℐ ” 𝐶 ::= ⊥ ∅ (bottom) ⋀︀𝑘 | 𝐶 : Pf 1 , ..., Pf 𝑘 → Pf 0 {𝑥 | ∀𝑦.((𝑦 ∈ 𝐶 ℐ ∧ ( 𝑖=0 {𝑥, 𝑦} ⊆ (∃Pf 𝑖 .⊤)ℐ ) (PFD) ⋀︀𝑘 ( 𝑖=1 Pf ℐ𝑖 (𝑥) = Pf ℐ𝑖 (𝑦))) → (Pf ℐ0 (𝑥) = Pf ℐ0 (𝑦)))} | ⊤ △ℐ (top) | A Aℐ ⊆ △ℐ (primitive concept; A ∈ PC) | ∃Pf.𝐶 {𝑥 | ∃𝑦.(𝑦 ∈ 𝐶 ℐ ∧ Pf ℐ (𝑥) = 𝑦)} (value restriction) | 𝐶1 ⊓ 𝐶2 𝐶1ℐ ∩ 𝐶2ℐ (conjunction) | {𝑎} {𝑎ℐ } (nominal) | ∃𝑓 −1 .𝐶 {𝑓 ℐ (𝑥) | 𝑥 ∈ 𝐶 ℐ } (qualified inverse feature) Figure 2: Syntax and semantics of concept descriptions. the grammar “Pf ::= 𝑓. Pf | id ” for 𝑓 ∈ F. A concept description is defined by the grammar on the left-hand-side of Fig. 2.2 An inclusion dependency is an expression of the form 𝐶1 ⊑ 𝐶2 , where 𝐶𝑖 is parsed by the first six productions in Fig. 2. A terminology (TBox) 𝒯 consists of a finite set of inclusion dependencies. A referring expression is a concept description 𝐶 parsed by the last six productions in Fig. 2. A concept box (CBox) 𝒞 consists of a finite set of referring expressions. A knowledge base 𝒦 is a TBox/CBox pair (𝒯 , 𝒞). The semantics of concept descriptions and path expressions is defined with respect to a structure ℐ = (△ℐ , ·ℐ ), where △ℐ is a domain of “objects” and ·ℐ an interpretation function that fixes the interpretations of primitive concepts 𝐴 to be subsets of △ℐ and primitive features 𝑓 to be partial functions 𝑓 ℐ : △ℐ → △ℐ . The interpretation is extended to path expressions, id ℐ = 𝜆𝑥.𝑥, (𝑓. Pf)ℐ = Pf ℐ ∘𝑓 ℐ , in the natural way, and derived concept descriptions 𝐶 as defined in the centre column of Fig. 2. An interpretation ℐ satisfies an inclusion dependency 𝐶1 ⊑ 𝐶2 if 𝐶1ℐ ⊆ 𝐶2ℐ , and is a model of a TBox 𝒯 if it satisfies all inclusion dependencies in 𝒯 . ℐ is a model of a knowledge base 𝒦 = (𝒯 , 𝒞), written ℐ |= 𝒦, if it satisfies 𝒯 and also that |𝐶 ℐ | > 0 holds for every 𝐶 ∈ 𝒞. Given a TBox 𝒯 , a referring expression 𝐶 is singular with respect to 𝒯 if |𝐶 ℐ | ≤ 1 for all interpretations ℐ that are models of 𝒯 . The logical implication problem asks if 𝒦 |= 𝐶1 ⊑ 𝐶2 holds, that is, if 𝐶1 ⊑ 𝐶2 is satisfied in all models of 𝒦. □ Definition 2 (Admissibility and Query Answers). Let 𝒦 = (𝒯 , 𝒞) be a FunDL knowledge base and 𝑄 = {(𝑥1 , . . . , 𝑥𝑘 ) | 𝜙} a conjunctive query. The CBox 𝒞 is admissible for 𝒯 if each 𝐶 ∈ 𝒞 is a referring expression that is singular with respect to 𝒯 . (Thus, if 𝒦 is consistent and 𝒞 is admissible for 𝒯 , then |𝐶 ℐ | = 1 for any 𝐶 ∈ 𝒞 and any model ℐ for which ℐ |= 𝒦.) 2 A variety of equality generating dependencies, including keys, can be expressed with the use of a path functional dependency (PFD) concept description generated by the second production of this grammar. A k-tuple of referring expressions (𝐶1 , . . . , 𝐶𝑘 ) is a certain answer to 𝑄 in (𝒯 , 𝒞) if 𝒦 |= ∃𝑥1 , . . . , 𝑥𝑘 .(𝜙 ∧ 𝐶1 (𝑥1 ) ∧ . . . ∧ 𝐶𝑘 (𝑥𝑘 )) for {𝐶1 , . . . , 𝐶𝑘 } ⊆ 𝒞. □ The second artifact introduced in our introduction relates to so-called referring expression types (RETs) introduced in [1]. In this earlier work, such types were attached to the free variables of a conjunctive query, and denoted a space of well-formed formulae 𝜓 with one free variable, over a given FO signature consisting of unary and binary predicates, that were eligible referring expressions for the variable. Such types are essentially patterns of possible 𝜓, in our case, patterns of possible concept descriptions 𝐶, and are now attached to primitive concepts by a user defined referring expression type assignment (RTA). They determine a set of possible concept descriptions that are eligible referring expressions for subdocuments. We illustrate this below for our running example, but leave the presentation on how this is accomplished to Section 3. Definition 3 (Referring Expression Types and Assignments). A referring expression type is defined by the following grammar:3 𝑅𝑒 ::= A | ∃Pf.𝑅𝑒 | 𝑅𝑒 ⊓ 𝑅𝑒 | {?} | 𝑅𝑒 ; 𝑅𝑒 A referring expression type assignment (RTA) over a TBox 𝒯 is partial function mapping primitive concepts A occurring in 𝒯 to a referring expression type RTA(A). We define a language of referring expressions inhabiting 𝑅𝑒, ℒ(𝑅𝑒), as follows: ℒ(A) = {A} ℒ(∃Pf.𝑅𝑒) = {∃Pf.𝐶 | 𝐶 ∈ ℒ(𝑅𝑒)} ℒ(𝑅𝑒1 ⊓ 𝑅𝑒2 ) = {𝐶1 ⊓ 𝐶2 | 𝐶1 ∈ ℒ(𝑅𝑒1 ) and 𝐶2 ∈ ℒ(𝑅𝑒2 )} ℒ({?}) = {{𝑏} | 𝑏 is a constant symbol} ℒ(𝑅𝑒1 ; 𝑅𝑒2 ) = ℒ(𝑅𝑒1 ) ∪ ℒ(𝑅𝑒2 ) □ Example 3 (CBox normalization). Let CBox 𝒞 consist of the single referring expression in Exam- ple 1 obtained from the MongoDB collection given earlier, and let TBox 𝒯 consist of the following inclusion dependencies: (∃ collection.⊤) ⊓ (∃ data.⊤) ⊑ DOCUMENT (∃ fname.⊤) ⊓ (∃ lname.⊤) ⊑ PERSON ∃ dnum.⊤ ⊑ PHONE DOCUMENT ⊑ DOCUMENT : collection → id PERSON ⊑ PERSON : fname, lname → id PHONE ⊑ PHONE : dnum → id PERSON ⊑ ∃ wife.PERSON 3 This is a pattern language obtained by abstracting nominals in referring expressions and by admitting a final production to express preference among referring expressions [1]. Our normalization procedure presented in the next section, when given the knowledge base 𝒦 = (𝒯 , 𝒞) together with the following RTA assignment, will then replace 𝒞 in 𝒦 by the CBox in Example 2 of our introduction: RTA(DOCUMENT) = DOCUMENT ⊓ ∃ collection.{?} RTA(PERSON) = PERSON ⊓ ∃ lname.{?} ⊓ ∃ fname.{?} RTA(PHONE) = PHONE ⊓ ∃ dnum.{?} □ Mapping a JSON value to a referring expression in a CBox is straightforward with FunDL concepts. The following definition of ToConcept provides the details. (Recall that Example 1 in our introduction illustrates an invocation of ToConcept on a JSON collection.) Some obser- vations and reminders: (1) this mapping assumes any JSON value, including an array, will map to some element of an underlying domain; (2) the mapping relies entirely on interpreting field names in field-value pairs comprising JSON objects as feature names; and (3) arrays (treated as sets) are mapped to multi-valued roles reified via the features dom and ran. Definition 4 (ToConcept). An arbitrary JSON value is mapped to a CBox referring expression as follows: ToConcept("s") ↦→ {"s"} ToConcept(null) ↦→ ⊤ ToConcept({"k1 " : 𝑣1 , . . . , "kn " : 𝑣𝑛 }) ↦→ ∃ k1 . ToConcept(𝑣1 ) . . . ⊓ ∃ kn . ToConcept(𝑣𝑛 ) ToConcept([𝑣1 , . . . , 𝑣𝑚 ]) ↦→ ∃ dom− .∃ ran. ToConcept(𝑣1 ) ⊓ . . . ⊓ ∃ dom− .∃ ran. ToConcept(𝑣𝑚 ) where the first case covers all JSON values that are strings, numerics, and Booleans. □ 3. CBox Normalization Our CBox normalization procedure derives from a pair of normalization rules presented in Section 3.2. To enable references to sub-concepts in a concept description required by our formulation of these rules, we define the mapping ToAbox. Definition 5 (ToABox). An arbitrary referring expression 𝐶 is mapped to an ABox and a mapping from constant symbols 𝑎𝑖 in this ABox to nodes in the syntactic tree of 𝐶 as follows: ToAbox(𝑎 : {𝑏}) ↦→ {𝑎 = 𝑏} ToAbox(𝑎 : A) ↦→ {A(𝑎)}, A primitive ToAbox(𝑎 : ∃𝑓.𝐶) ↦→ {𝑓 (𝑎) = 𝑏} ∪ ToAbox(𝑏 : 𝐶), 𝑏 fresh ToAbox(𝑎 : ∃𝑓 −1 .𝐶) ↦→ ⋃︀ {𝑓 (𝑏) = 𝑎} ∪ ToAbox(𝑏 : 𝐶), 𝑏 fresh 𝑛 ToAbox(𝑎 : 𝐶1 ⊓ . . . ⊓ 𝐶𝑛 ) ↦→ 𝑖=1 {𝑎 = 𝑏𝑖 } ∪ ToAbox(𝑏𝑖 : 𝐶𝑖 ), 𝑏𝑖 fresh □ ⋃︀ We define ToAbox(𝒞) = 𝐶∈𝒞 ToAbox(𝑎 : 𝐶), 𝑎 fresh for each 𝐶. The ToAbox(𝐶) function converts an input concept 𝐶 to an ABox by traversing the syntactic structure of the concept, assigning distinct constant symbols 𝑎 to subconcepts of 𝐶, and then replacing these subconcepts by atomic ABox assertions. Definition 6 (Context). Let 𝐶 be a concept description. We use the notation 𝐶[𝑎′ : 𝐶 ′ ] to denote a subconcept 𝐶 ′ of the concept 𝐶 where 𝑎′ is the constant symbol assigned to 𝐶 ′ by ToAbox(𝑎 : 𝐶). For the top-level concept we simply use the context [𝑎 : 𝐶]. A projection function ToRE that uses a referring expression type 𝑅𝑒 to generate a referring expression 𝐶 in ℒ(𝑅𝑒) and holding for a particular individual 𝑎 in a knowledge base (or producing an undefined value if no such 𝐶 exists) is now defined. (Observe the use of an auxiliary recursive function with the same name in the definition that takes a path function Pf as a third argument.) Definition 7 (Projection on 𝑅𝑒). Let 𝒦 be a consistent knowledge base and 𝑎 a constant. We define a projection function ToRE(𝑎, 𝑅𝑒) for a referring expression 𝑅𝑒 as the result of the following recursive definition of ToRE(𝑎, 𝑅𝑒, id ) on the structure of 𝑅𝑒: ToRE(𝑎, A, Pf) = A if 𝒦 |= 𝑎 : ∃Pf.A, undefined otherwise ToRE(𝑎, {?}, Pf) = {𝑏} if 𝒦 |= 𝑎 : ∃Pf.{𝑏} for some 𝑏, undefined otherwise ToRE(𝑎, ∃Pf ′ .𝑅𝑒, Pf) = ∃Pf ′ .ToRE(𝑎, 𝑅𝑒, Pf . Pf ′ ) ToRE(𝑎, 𝑅𝑒1 ⊓ 𝑅𝑒2 , Pf) = ToRE(𝑎, 𝑅𝑒1 , Pf) ⊓ ToRE(𝑎, 𝑅𝑒2 , Pf) if both defined ToRE(𝑎, 𝑅𝑒1 ; 𝑅𝑒2 , Pf) = ToRE(𝑎, 𝑅𝑒1 , Pf) if defined, ToRE(𝑎, 𝑅𝑒2 , Pf) otherwise □ The following is a simple consequence of this and our previous definitions. Also, see earlier work in [5, 6] for effective ways of computing the second and fourth base cases. Lemma 1. For any constant 𝑎, any 𝑅𝑒, and any consistent 𝒦, ToRE(𝑎, 𝑅𝑒) ∈ ℒ(𝑅𝑒). It will also be useful to have a simplification procedure for referring concepts. The following definition of such a procedure will suffice for illustrative purposes, and clearly preserves concept equivalence. Definition 8 (Concept Simplification). We write Simplify(𝐶) to denote an exhaustive application of the following to referring expression 𝐶: 1. If an 𝑛-way conjunction contains ∃𝑓.𝐶1 and ∃𝑓.𝐶2 , replace both conjuncts by ∃𝑓.𝐶1 ⊓ 𝐶2 . 2. If an 𝑛-way conjunction contains duplicate conjuncts, remove one of the conjuncts. □ 3.1. CBox Admissibility In this subsection, we show how, given a TBox 𝒯 , one can statically test for admissibility of any CBox obtained by the normalization rules given in Subsection 3.2 that follows. This is achieved by a mapping to a sequence of logical consequence problems for inclusion dependencies expressing functional dependencies with PFDs that are induced by a given RTA assignment. We begin by defining a normalization of an 𝑅𝑒 that preserves ℒ(𝑅𝑒). Definition 9 (Normalized Types; from [1]). We write Norm(𝑅𝑒) to refer to an exhaustive application of the following rewrite rules to 𝑅𝑒: 𝑅𝑒 ⊓ (𝑅𝑒1 ; 𝑅𝑒2 ) ↦→ 𝑅𝑒 ⊓ 𝑅𝑒1 ; 𝑅𝑒 ⊓ 𝑅𝑒2 (𝑅𝑒1 ; 𝑅𝑒2 ) ⊓ 𝑅𝑒 ↦→ 𝑅𝑒1 ⊓ 𝑅𝑒1 ; 𝑅𝑒2 ⊓ 𝑅𝑒 ∃Pf.(𝑅𝑒1 ; 𝑅𝑒2 ) ↦→ ∃Pf.𝑅𝑒1 ; ∃Pf.𝑅𝑒2 □ The following are simple consequences: (1) ℒ(𝑅𝑒) = ℒ(Norm(𝑅𝑒)), and (2) all preference operators (“;”) are at the top level. We call the maximal “;”-free parts of Norm(𝑅𝑒𝑡) preference-free components. To statically test for singularity of referring expressions generated by the ToRE function for a particular referring expression type, we use the following auxiliary definitions: Pfs({?}) = {id } Con({?}) = ⊤ Pfs(A) = { } Con(A) = A Pfs(∃Pf ′ .𝑅𝑒) = {Pf ′ . Pf | Pf ∈ Pfs(𝑅𝑒)} Con(∃Pf ′ .𝑅𝑒) = ∃Pf ′ .Con(𝑅𝑒) Pfs(𝑅𝑒1 ⊓ 𝑅𝑒1 ) = Pfs(𝑅𝑒1 ) ∪ Pfs(𝑅𝑒2 ) Con(𝑅𝑒1 ⊓ 𝑅𝑒1 ) = Con(𝑅𝑒1 ) ⊓ Con(𝑅𝑒2 ) These functions extract a set of paths leading to nominals and a concept from a preference-free referring expression type. Altogether, we are now able to formulate the singularity test following the ideas presented in [1], Theorem 20: Theorem 1. Let 𝒯 be a TBox and 𝑅𝑒 a referring expression type. Then all referring expressions in ℒ(𝑅𝑒) are singular if 𝒯 |= Con(𝑅𝑒′ ) ⊑ Con(𝑅𝑒′ ) : Pfs(𝑅𝑒′ ) → id for every preference-free component 𝑅𝑒′ of Norm(𝑅𝑒). Our static test of admissibility of any CBox generated by our normalization rules then follows by applying the above to any 𝑅𝑒 in the range of a programmer supplied RTA. 3.2. CBox Normalization Rules We now have the necessary machinery to present our two rules for normalizing the CBox of a given knowledge base 𝒦 = (𝒯 , 𝒞) and referring expression type assignment RTA. Our first main rule extracts sub-concepts of a given CBox concept 𝐶 as an additional separate CBox concept. Definition 10 (Subdocument Extraction). Assume 𝐶 is a concept in 𝒞 which contains a subcon- cept 𝐶 ′ (i.e., 𝐶[𝑎 : 𝐶 ′ ]) corresponding to a JSON object. Also assume two properties hold: that (𝒯 , ToAbox(𝒞)) |= 𝐴(𝑎), and that RTA(𝐴) is defined for primitive concept 𝐴. We form a new CBox 𝒞 ′ as follows 𝒞 ′ := 𝒞 − {𝐶[𝑎 : 𝐶 ′ ]} ∪ {𝐶[𝑎 : ToRE(𝑎, RTA(𝐴))], ToRE(𝑎, RTA(𝐴)) ⊓ 𝐶 ′ } if ToRE(𝑎, RTA(𝐴)) is defined. If an entity 𝐴 cannot be properly identified, we report a warning. □ Intuitively, we replace a single monolithic concept 𝐶 in 𝒞 in which a subconcept 𝐶 ′ was identified as a representation of an 𝐴 entity by a modified variant of 𝐶 in which 𝐶 ′ has been replaced by its referring expression. In addition we create a new CBox concept ToRE(𝑎, 𝑅𝑒(𝐴)) ⊓ 𝐶 ′ for this entity. Note that the choice of 𝐴 above is non-deterministic, but does not affect the soundness of the extraction. However, this non-determinism can result in different referring expressions used to identify the same subdocument. Although beyond the scope this paper, in [7], we described a technique that allows one to diagnose that a given RTA has anticipated any such ambiguity, that any choice of 𝐴 must still lead to the same syntactic referring expression.4 We refer to such an RTA as identity resolving. Also note that, due to the properties of referring expressions (singularity in particular) we have (𝒯 , ToAbox(𝒞 ′ )) |= 𝑎 = 𝑏 for the above-introduced concepts 𝐶[𝑎 : ToRE(𝑎, RTA(𝐴))] and [𝑏 : ToRE(𝑎, RTA(𝐴)) ⊓ 𝐶 ′ ] in 𝒞 ′ . Hence, the models of (𝒯 , 𝒞) and (𝒯 , 𝒞 ′ ) will coincide. The second of our two rules replaces two CBox referring expressions with a single referring expression when co-reference is implied. Note that, were the given RTA identity resolving, we would always have 𝐷 = 𝐷′ . Definition 11 (Equivalent Subdocument Merge). Let [𝑎 : 𝐷 ⊓ 𝐶] and [𝑏 : 𝐷′ ⊓ 𝐶 ′ ] be two concepts in 𝒞 such that (𝒯 , ToAbox(𝒞)) |= 𝑎 = 𝑏. We replace 𝒞 with 𝒞 ′ := 𝒞 − {[𝑎 : 𝐷 ⊓ 𝐶], [𝑏 : 𝐷′ ⊓ 𝐶 ′ ]} ∪ {[𝑎 : 𝐷 ⊓ Simplify(𝐶 ⊓ 𝐶 ′ )}. □ Our main results now follow. Theorem 2. Let 𝒞 be a CBox in a consistent knowledge base 𝒦 = (𝒯 , 𝒞) and 𝒞 ′ a CBox obtained by applying the Subdocument Extraction or the Equivalent Subdocument Merge rules. Then every model of 𝒦 is also a model of (𝒯 , 𝒞 ′ ) and vice versa. Strictly speaking, this is not true for (𝒯 , ToAbox(𝒞)) and (𝒯 , ToAbox(𝒞 ′ )), but the models of these two knowledge bases will only differ in what constants are assigned to what subconcepts in 𝒞 and 𝒞 ′ , respectively. Theorem 3. Let 𝒞 be an admissible CBox in a consistent knowledge base 𝒦 = (𝒯 , 𝒞) and 𝒞 ′ a CBox obtained by applying the Subdocument Extraction or the Equivalent Subdocument Merge rules. Then 𝒞 ′ is admissible. 4. Summary Comments The main contribution of this paper is showing how JSON-like data sources can be abstracted as concept descriptions in an appropriate DL in a very generic way. This enables their domain- specific semantics to be naturally captured as a TBox in the same logic, which then allows one to draw on mature reasoning services that have been developed for DLs [8, 4]. In addition, our approach utilizes referring expressions [1] as the means of identifying entities described in such data sources. This is crucial to the proposal’s ability to detect multiple subdocuments that provide information about the same entity. Notably, this is achieved completely automatically since the appropriate equalities will be entailed by the knowledge base consisting of a domain- specific TBox and the data sources captured as concepts in a CBox. Our primary technical contribution is our ability to separate entities into separate documents and to consolidate documents that provide information about the same entity, also by appeal to referring expressions and to entailment in the underlying DL. 4 Essentially, this entails ensuring that preference in RTA(𝐴), for any primitive concept 𝐴, exhaustively accounts for any primitive concept 𝐵 for which there exists some interpretation ℐ for which (𝐴 ⊓ 𝐵)ℐ is non-empty. 4.1. Related Work There are many papers that take as input a semi-structured document, in JSON or XML (some- times with a schema) plus an ontology in a DL TBox, and create individual instance descriptions in an ABox; for a survey see [9]. Usually, this is intended for just adding semantics to the document, but reasoning could be used to detect inconsistencies, such as a situation where two properties with disjoint domains apply to the same individual. For example, the use of XPath to navigate XML documents and detect instances of objects belonging to certain OWL classes has been presented in [10]. In particular, their approach and implementation, i.e., their JXML2OWL framework, maps XML documents to existing OWL ontologies via explicit mapping rules that use XPath. Out approach, in contrast, uses a generic mapping of JSON to concept descriptions in a DL and then uses the full power of reasoning in the DL to capture such mappings and to achieve a variety other goals, including detection of entities and entity-based equality between subdocuments. The closest to our work is a virtual OBDA architecture, where data is stored in a JSON MongoDB repository [11]. The architecture extends the basic OBDA architecture by introducing a relational view (an ABox) over MongoDB with respect to a set of type constraints. Rewritten queries by the OBDA framework over the relational view are translated using a fragment of MongoDB aggregate queries. However, as far as we are aware, functional dependencies are not involved in identification issues, nor in separating entities into separate documents. There is a great deal of work that attempts to synthesize schemata (in various formalisms) from the semi-structured data [9]. However, these approaches are orthogonal to the results in our paper. 4.2. Future work and Extensions There are many avenues for future research: 1. Additional CBox entries generated by reified roles: in Example 2, we could also have created additional CBox entries for (reified) phone ownership, e.g., HAS-PHONE ⊓ ∃ dom.phone−1 .(PERSON ⊓ ∃ fname.{"John"} ⊓ ∃ lname.{"Smith"}) ⊓ ∃ ran.(PHONE ⊓ ∃ dnum{"212 555-1234"}). This, however, requires generalizing the PFD concept constructor and path descriptions to allow for a limited use of inverse features. 2. Diagnosis via consistency and pinpointing/data cleaning: inconsistency of the knowledge base consisting of the domain knowledge 𝒯 and the CBox ToConcept(𝑑𝑜𝑐) indicates that either our domain knowledge does not accurately capture the properties of the documents or that the documents themselves contain erroneous data. Axiom pinpointing [12] and data cleaning [13] can help in this situation. 3. Set-valued properties and referring expressions: another extension relates to extending the RTAs to allow identification to be based on set-valued properties/values. Such an extension, however, requires extensions to the equality-generating constraints in the underlying DL and the ToRE operation. References [1] A. Borgida, D. Toman, G. Weddell, On referring expressions in query answering over first order knowledge bases, in: Proc. KR, 2016, pp. 319–328. [2] D. Toman, G. E. Weddell, Identity resolution in ontology based data access to structured data sources, in: A. C. Nayak, A. Sharma (Eds.), PRICAI 2019: Trends in Artificial Intelligence - 16th Pacific Rim International Conference on Artificial Intelligence, Part I, volume 11670 of Lecture Notes in Computer Science, Springer, 2019, pp. 473–485. [3] S. McIntyre, D. Toman, G. E. Weddell, FunDL - A family of feature-based description logics, with applications in querying structured data sources, in: Description Logic, Theory Combination, and All That - Essays Dedicated to Franz Baader on the Occasion of His 60th Birthday, 2019, pp. 404–430. [4] S. McIntyre, A. Borgida, D. Toman, G. E. Weddell, On limited conjunctions and partial features in parameter-tractable feature logics, in: The Thirty-Third AAAI Conference on Artificial Intelligence, AAAI 2019, 2019, pp. 2995–3002. [5] J. Pound, D. Toman, G. E. Weddell, J. Wu, Query algebra and query optimization for concept assertion retrieval, in: V. Haarslev, D. Toman, G. E. Weddell (Eds.), Proceedings of the 23rd International Workshop on Description Logics (DL 2010), Waterloo, Ontario, Canada, May 4-7, 2010, volume 573 of CEUR Workshop Proceedings, CEUR-WS.org, 2010. [6] J. Pound, D. Toman, G. E. Weddell, J. Wu, An assertion retrieval algebra for object queries over knowledge bases, in: T. Walsh (Ed.), IJCAI 2011, Proceedings of the 22nd International Joint Conference on Artificial Intelligence, Barcelona, Catalonia, Spain, July 16-22, 2011, IJCAI/AAAI, 2011, pp. 1051–1056. [7] A. Borgida, D. Toman, G. E. Weddell, On referring expressions in information systems derived from conceptual modelling, in: I. Comyn-Wattiau, K. Tanaka, I. Song, S. Yamamoto, M. Saeki (Eds.), Conceptual Modeling - 35th International Conference, ER 2016, Gifu, Japan, November 14-17, 2016, Proceedings, volume 9974 of Lecture Notes in Computer Science, 2016, pp. 183–197. [8] F. Baader, D. Calvanese, D. L. McGuinness, D. Nardi, P. F. Patel-Schneider, The Description Logic Handbook: Theory, Implementation, and Applications, Cambridge University Press, 2003. [9] M. Hacherouf, S. N. Bahloul, C. Cruz, Transforming xml documents to owl ontologies: A survey, Journal of Information Science 41 (2015) 242–259. [10] T. Rodrigues, P. Rosa, J. Cardoso, Mapping xml to exiting owl ontologies, in: International Conference WWW/Internet, Citeseer, 2006, pp. 72–77. [11] E. Botoeva, D. Calvanese, B. Cogrel, M. Rezk, G. Xiao, OBDA beyond relational dbs: A study for mongodb, in: M. Lenzerini, R. Peñaloza (Eds.), Proceedings of the 29th International Workshop on Description Logics, Cape Town, South Africa, April 22-25, 2016, volume 1577 of CEUR Workshop Proceedings, CEUR-WS.org, 2016. [12] R. Peñaloza, Axiom pinpointing, in: G. Cota, M. Daquino, G. L. Pozzato (Eds.), Applications and Practices in Ontology Design, Extraction, and Reasoning, volume 49 of Studies on the Semantic Web, IOS Press, 2020, pp. 162–177. [13] I. F. Ilyas, X. Chu, Data Cleaning, ACM, 2019. URL: https://doi.org/10.1145/3310205. doi:10. 1145/3310205.