Accessing Document Data Sources using Referring
Expression Types
Alexander Borgida1 , Enrico Franconi2 , David Toman3 and Grant Weddell3
1
  Department of Computer Science, Rutgers University, New Brunswick, US
2
  KRDB Research Centre for Knowledge and Data, Free University of Bozen-Bolzano, Italy
3
  Cheriton School of Computer Science, University of Waterloo, Waterloo, Canada


                                         Abstract
                                         We show how JSON documents can be abstracted as concept descriptions in an appropriate description
                                         logic. This representation allows the use of additional background knowledge in the form of a TBox and
                                         an assignment of referring expression types (RETs) to certain primitive concepts to detect situations in
                                         which subdocuments, perhaps multiple subdocuments located in various parts of the original documents,
                                         capture information about a particular conceptual entity. Detecting such situations allows for normalizing
                                         the JSON document into several separate documents that capture all information about such conceptual
                                         entities in separate documents. This transformation preserves all the original information present in the
                                         input documents. The RET assignment contributes a set of possible concept descriptions that enable more
                                         refined and normalized capture of documents, and to more crafted answers to queries that adhere to user
                                         expectations expressed as RETs. We also show how RETs allow checking for a document admissibility
                                         condition ensuring that each document describes a single conceptual entity.


1. Introduction and Motivation
Suppose we have a JSON/mongoDB document, and an ontology attaching semantics to (most)
fields in JSON objects (called “keys” in the JSON definition; for the rest of this paper we reserve
the word “key” for database-like keys). More concretely we treat the fields in JSON objects as
(functional) roles in an underlying DL and use a TBox to add additional appropriate concept
identifiers for the domain of the document. For example, for JSON document

                                       { "fname": "John", "lname": "Smith", "age": 25,
                                         "wife": { "fname" : "Mary", "lname": "Smith" } }

the TBox might contain subsumptions stating, e.g., that:

                  • PERSONs are objects that posses fname and lname fields,
                  • fname and lname form a key for PERSONs,
                  • a wife of a PERSON is also a PERSON, and so on.

A formal way of capturing these constraints in our FunDL dialect is given in Example 3 below.

  DL 2022: 35th International Workshop on Description Logics, August 7–10, 2022, Haifa, Israel
$ borgida@cs.rutgers.edu (A. Borgida); franconi@inf.unibz.it (E. Franconi); david@uwaterloo.ca (D. Toman);
gweddell@uwaterloo.ca (G. Weddell)
                                       © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)
   We are interested in asking conjunctive queries over the document, with answers being
returned in some “answer language” ℒ𝑎𝑛𝑠𝑤𝑒𝑟 . We could simply create distinct identifiers for
each node in the JSON tree, and use roles (obtained from the fields of JSON objects) to connect
them, resulting in an ABox, which in the above case might contain an assertion (id434,25)
: age. In this case, a simple query like q(P) :- age(P,25) would return id434, which is
quite meaningless without laboriously tracing back the creation of the ABox; and this answer
makes no connection with the knowledge in the ontology. A more desirable answer would use
the TBox, and return something like PERSON with fname=”John” and lname=”Smith”. This issue,
and many others concerning the appropriate choice of description for objects in query answers
was first addressed in [1], where the notion of “singular referring expression” was introduced.
It stood for a concept description that was guaranteed to denote a single individual.
   Given the use of a referring expression in ℒ𝑎𝑛𝑠𝑤𝑒𝑟 to, among others, avoid meaningless object
ids, another paper [2] suggested doing a similar thing in the ℒ𝑡𝑒𝑙𝑙 : replacing the ABox with a
“CBox”, where singular referring expressions are used to state assertional facts as concepts the
KB knows about. For example, the JSON fragment above could be captured by the concept
                  ∃ fname.{"John"} ⊓ ∃ lname.{"Smith"} ⊓ ∃ age.{"25"} ⊓
                  ∃ wife.(∃ fname.{"Mary"} ⊓ ∃ lname.{"Smith"})
Here, nominal concepts play a central role, as do key constraints for singularity, and a general
rule that CBox concepts cannot have empty interpretations.
   An algorithm was given in [2] for computing certain answers to conjunctive queries, which
returns singular referring expressions appearing in the CBox. This work was carried out in
a dialect of the FunDL family of description logics [3]. FunDL and its dialects replace roles
with (possibly partial) unary functions called features, reify general roles, and always include a
so-called path functional dependency (PFD) concept constructor to express constraints that are
equality generating (including keys). In particular, the dialect used for these purposes in [2] has
the property that DL reasoning and query answering are both polynomial-time (in the size of
the underlying knowledge base). The above paper also gave, as an example of CBox use, one
way to represent a JSON document (in particular, for a document in a MongoDB collection), as
a singular concept description in FunDL.
Example 1 (from [2]). Consider the case where PERSON is the name of a MongoDB JSON collection
with a value as given in Figure 1. The PERSON document is captured by a CBox containing the
following concept description expressed in terms of FunDL.
 ∃ collection.{"person"} ⊓
 ∃ data.∃ dom− .∃ ran(
            ∃ fname.{"John"} ⊓ ∃ lname.{"Smith"} ⊓
            ∃ age.{"25"} ⊓ ∃ wife.∃ fname.{"Mary"} ⊓
            ∃ phone.∃ dom− .∃ ran(∃ colour.{"red"} ⊓ ∃ dnum.{"212 555-1234"})) ⊓
        ∃ dom− .∃ ran(
            ∃ fname.{"Mary"} ⊓ ∃ lname.{"Jones"} ⊓
            ∃ salary.{"$150000CAD"} ⊓ ∃ spouse.∃ fname.{"John"} ⊓
            ∃ phone.∃ dom− .∃ ran(∃ loc.{"home"} ⊓ ∃ dnum.{"212 555-1234"}) ⊓
                     ∃ dom− .∃ ran(∃ loc.{"work"} ⊓ ∃ dnum.{"212 666-4567"}))    □
{ "collection": "person",
 "data" : [
  { "fname": "John", "lname": "Smith", "age": 25,
    "wife": { "fname" : "Mary" },
    "phone": [
       {"colour": "red", "dnum": "212 555-1234"}
     ] } ,
  { "fname": "Mary", "lname": "Jones", "salary": "$150,000 (CAD)",
    "spouse": { "fname": "John" },
    "phone": [
       {"loc": "home", "dnum": "212 555-1234"},
       {"loc": "work", "dnum": "212 666-4567"}
     ] }
 ] }


Figure 1: JSON PERSON Document.


   Intuitively, JSON values are mapped to concepts as follows: (a) primitive values to nominals,
(b) (compound) objects to conjunctions of existential restrictions of features corresponding to
the field names of the (mapping of) values, and (3) arrays (treated as sets) to multi-valued roles
reified via the features dom and ran (see Definition 4).
   One can infer that a document would have the above-mentioned singularity property by
asserting in a FunDL TBox that collection names (person in our case) are unique.
   Since in our case query answers are elements of the CBox, the above translation of a full
JSON document into a single CBox entry is undesirable. For this reason, we propose to break it
up into smaller conceptual entities that make sense based on the terminology in the TBox. This
is achieved by using referring expression types (RETs) introduced in [1]. In particular, we will
show how a combination of a TBox and an RET assignment to the primitive concepts occurring
in the TBox enable mapping an initial CBox directly obtained from a MongoDB database, as
illustrated above, to an alternative normalized CBox, as now illustrated.
Example 2 (JSON normalization). Applying our normalization procedure to the CBox in Example 1
will then obtain the following concepts:

 DOCUMENT ⊓ ∃ collection.{"person"} ⊓
       ∃ data(∃ dom− .∃ ran(PERSON ⊓ ∃ fname.{"John"} ⊓ ∃ lname.{"Smith"}) ⊓
              ∃ dom− .∃ ran(PERSON ⊓ ∃ fname.{"Mary"} ⊓ ∃ lname.{"Jones"}))

 PERSON ⊓ ∃ fname.{"John"} ⊓ ∃ lname.{"Smith"} ⊓
               ∃ age.{"25"} ⊓ ∃ wife.∃ fname{"Mary"} ⊓
               ∃ phone.∃ dom− .∃ ran(PHONE ⊓ ∃ dnum{"212 555-1234"})
 PERSON ⊓ ∃ fname.{"Mary"} ⊓ ∃ lname.{"Jones"} ⊓
               ∃ salary.{"$150000CAD"} ⊓ ∃ spouse.∃ fname{"John"} ⊓
               ∃ phone.∃ dom− .∃ ran(PHONE ⊓ ∃ dnum{"212 555-1234"}) ⊓
                       ∃ dom− .∃ ran(PHONE ⊓ ∃ dnum{"212 666-4567"}))
          PHONE ⊓ ∃ dnum{"212 555-1234"} ⊓ ∃ loc.{"home"} ⊓ ∃ colour.{"red"}
          PHONE ⊓ ∃ dnum{"212 555-4567"} ⊓ ∃ loc.{"work"}
In the above, the underlined parts of these concepts serve as referring expressions identifying entities,
while the remainder of the concept tells us facts about the entities, using (the dash-underlined)
referring expressions when needed, to record such facts. For example, observe how references to
phone entities in phone facts about persons require only dnum facts about phones.                     □

  Our contributions are as follows.
   1. We show how a JSON document (or a MongoDB collection) can be abstracted as a concept
      description, as illustrated in Example 1, and contrast this way of attaching a meaning or
      semantics with other proposals.
   2. We show how an TBox can attach meaning to such concept descriptions (and therefore
      to the JSON documents).
   3. We describe the normalization procedure which uses the given TBox and a referring
      expression type assignment in order to extract additional intuitively reasonable CBox
      subconcepts, as illustrated in Example 2.
   4. We also show how information about the same entity can be consolidated even if it was
      originally recorded in different parts of the JSON document.
   5. Finally, we present a more effective way of diagnosing an admissibility property of a
      CBox that ensures interpretations of referring expressions are indeed singular.
All of the above tasks are accomplished by relying solely on reasoning about equalities and
concept memberships of objects corresponding to values in the input JSON document with
respect to the DL TBox. This sets our approach apart from many other approaches that
commonly rely on mapping and transformation rules.
   The paper is organized as follows: Section 2 provides the needed background relating to
FunDL and to referring expressions. Section 3 then outlines the main results of this paper
relating to the ability to identify subdocuments relating to identifiable entities and subsequent
separation of these entities in separate CBox entries/documents. We conclude with a brief
overview of related work and with suggestions for follow-on research.


2. Definitions and Background
We now formally define the artifacts introduced in our introductory comments, beginning with
a general definition of concept descriptions for members of the FunDL family of DLs with
PTIME complexity of logical consequence.1 Recall that members of this family replace roles
with partial functions, and that concept descriptions not only occur in a TBox but also serve as
referring expressions in a CBox.

Definition 1 (FunDL Concepts, Referring Expressions and Knowledge Bases). Let F and PC be
sets of feature names and primitive concept names, respectively. A path expression is defined by

    1
        Some additional conditions must hold on PFDs and on conjunctions; see [3, 4] for details.
              Syntax                             Semantics: Defn of “(·)ℐ ”

𝐶 ::= ⊥                              ∅                                                                      (bottom)
                                                           ⋀︀𝑘
    | 𝐶 : Pf 1 , ..., Pf 𝑘 → Pf 0    {𝑥 | ∀𝑦.((𝑦 ∈ 𝐶 ℐ ∧ ( 𝑖=0 {𝑥, 𝑦} ⊆ (∃Pf 𝑖 .⊤)ℐ )                          (PFD)
                                           ⋀︀𝑘
                                          ( 𝑖=1 Pf ℐ𝑖 (𝑥) = Pf ℐ𝑖 (𝑦))) → (Pf ℐ0 (𝑥) = Pf ℐ0 (𝑦)))}

    |   ⊤                            △ℐ                                                                       (top)
    |   A                            Aℐ ⊆ △ℐ                                         (primitive concept; A ∈ PC)
    |   ∃Pf.𝐶                        {𝑥 | ∃𝑦.(𝑦 ∈ 𝐶 ℐ ∧ Pf ℐ (𝑥) = 𝑦)}                          (value restriction)
    |   𝐶1 ⊓ 𝐶2                      𝐶1ℐ ∩ 𝐶2ℐ                                                       (conjunction)

    | {𝑎}                            {𝑎ℐ }                                                               (nominal)
    | ∃𝑓 −1 .𝐶                       {𝑓 ℐ (𝑥) | 𝑥 ∈ 𝐶 ℐ }                               (qualified inverse feature)

Figure 2: Syntax and semantics of concept descriptions.


the grammar “Pf ::= 𝑓. Pf | id ” for 𝑓 ∈ F. A concept description is defined by the grammar on
the left-hand-side of Fig. 2.2
   An inclusion dependency is an expression of the form 𝐶1 ⊑ 𝐶2 , where 𝐶𝑖 is parsed by the first
six productions in Fig. 2. A terminology (TBox) 𝒯 consists of a finite set of inclusion dependencies.
A referring expression is a concept description 𝐶 parsed by the last six productions in Fig. 2. A
concept box (CBox) 𝒞 consists of a finite set of referring expressions. A knowledge base 𝒦 is a
TBox/CBox pair (𝒯 , 𝒞).
   The semantics of concept descriptions and path expressions is defined with respect to a structure
ℐ = (△ℐ , ·ℐ ), where △ℐ is a domain of “objects” and ·ℐ an interpretation function that fixes
the interpretations of primitive concepts 𝐴 to be subsets of △ℐ and primitive features 𝑓 to be
partial functions 𝑓 ℐ : △ℐ → △ℐ . The interpretation is extended to path expressions, id ℐ = 𝜆𝑥.𝑥,
(𝑓. Pf)ℐ = Pf ℐ ∘𝑓 ℐ , in the natural way, and derived concept descriptions 𝐶 as defined in the centre
column of Fig. 2.
   An interpretation ℐ satisfies an inclusion dependency 𝐶1 ⊑ 𝐶2 if 𝐶1ℐ ⊆ 𝐶2ℐ , and is a model
of a TBox 𝒯 if it satisfies all inclusion dependencies in 𝒯 . ℐ is a model of a knowledge base 𝒦
= (𝒯 , 𝒞), written ℐ |= 𝒦, if it satisfies 𝒯 and also that |𝐶 ℐ | > 0 holds for every 𝐶 ∈ 𝒞.
   Given a TBox 𝒯 , a referring expression 𝐶 is singular with respect to 𝒯 if |𝐶 ℐ | ≤ 1 for all
interpretations ℐ that are models of 𝒯 .
   The logical implication problem asks if 𝒦 |= 𝐶1 ⊑ 𝐶2 holds, that is, if 𝐶1 ⊑ 𝐶2 is satisfied in
all models of 𝒦.                                                                                    □

Definition 2 (Admissibility and Query Answers). Let 𝒦 = (𝒯 , 𝒞) be a FunDL knowledge base
and 𝑄 = {(𝑥1 , . . . , 𝑥𝑘 ) | 𝜙} a conjunctive query. The CBox 𝒞 is admissible for 𝒯 if each 𝐶 ∈ 𝒞 is
a referring expression that is singular with respect to 𝒯 . (Thus, if 𝒦 is consistent and 𝒞 is admissible
for 𝒯 , then |𝐶 ℐ | = 1 for any 𝐶 ∈ 𝒞 and any model ℐ for which ℐ |= 𝒦.)


    2
     A variety of equality generating dependencies, including keys, can be expressed with the use of a path functional
dependency (PFD) concept description generated by the second production of this grammar.
  A k-tuple of referring expressions (𝐶1 , . . . , 𝐶𝑘 ) is a certain answer to 𝑄 in (𝒯 , 𝒞) if

                              𝒦 |= ∃𝑥1 , . . . , 𝑥𝑘 .(𝜙 ∧ 𝐶1 (𝑥1 ) ∧ . . . ∧ 𝐶𝑘 (𝑥𝑘 ))

for {𝐶1 , . . . , 𝐶𝑘 } ⊆ 𝒞.                                                                                   □

   The second artifact introduced in our introduction relates to so-called referring expression
types (RETs) introduced in [1]. In this earlier work, such types were attached to the free variables
of a conjunctive query, and denoted a space of well-formed formulae 𝜓 with one free variable,
over a given FO signature consisting of unary and binary predicates, that were eligible referring
expressions for the variable. Such types are essentially patterns of possible 𝜓, in our case,
patterns of possible concept descriptions 𝐶, and are now attached to primitive concepts by a
user defined referring expression type assignment (RTA). They determine a set of possible concept
descriptions that are eligible referring expressions for subdocuments. We illustrate this below
for our running example, but leave the presentation on how this is accomplished to Section 3.

Definition 3 (Referring Expression Types and Assignments). A referring expression type is
defined by the following grammar:3

                               𝑅𝑒 ::= A | ∃Pf.𝑅𝑒 | 𝑅𝑒 ⊓ 𝑅𝑒 | {?} | 𝑅𝑒 ; 𝑅𝑒

A referring expression type assignment (RTA) over a TBox 𝒯 is partial function mapping primitive
concepts A occurring in 𝒯 to a referring expression type RTA(A). We define a language of referring
expressions inhabiting 𝑅𝑒, ℒ(𝑅𝑒), as follows:

                           ℒ(A) = {A}
                     ℒ(∃Pf.𝑅𝑒) = {∃Pf.𝐶 | 𝐶 ∈ ℒ(𝑅𝑒)}
                   ℒ(𝑅𝑒1 ⊓ 𝑅𝑒2 ) = {𝐶1 ⊓ 𝐶2 | 𝐶1 ∈ ℒ(𝑅𝑒1 ) and 𝐶2 ∈ ℒ(𝑅𝑒2 )}
                         ℒ({?}) = {{𝑏} | 𝑏 is a constant symbol}
                    ℒ(𝑅𝑒1 ; 𝑅𝑒2 ) = ℒ(𝑅𝑒1 ) ∪ ℒ(𝑅𝑒2 )                                                         □

Example 3 (CBox normalization). Let CBox 𝒞 consist of the single referring expression in Exam-
ple 1 obtained from the MongoDB collection given earlier, and let TBox 𝒯 consist of the following
inclusion dependencies:

          (∃ collection.⊤) ⊓ (∃ data.⊤) ⊑ DOCUMENT
               (∃ fname.⊤) ⊓ (∃ lname.⊤) ⊑ PERSON
                                ∃ dnum.⊤ ⊑ PHONE
                                  DOCUMENT             ⊑    DOCUMENT : collection → id
                                    PERSON             ⊑    PERSON : fname, lname → id
                                     PHONE             ⊑    PHONE : dnum → id
                                    PERSON             ⊑    ∃ wife.PERSON


    3
    This is a pattern language obtained by abstracting nominals in referring expressions and by admitting a final
production to express preference among referring expressions [1].
Our normalization procedure presented in the next section, when given the knowledge base 𝒦 =
(𝒯 , 𝒞) together with the following RTA assignment, will then replace 𝒞 in 𝒦 by the CBox in
Example 2 of our introduction:

             RTA(DOCUMENT) = DOCUMENT ⊓ ∃ collection.{?}
                RTA(PERSON) = PERSON ⊓ ∃ lname.{?} ⊓ ∃ fname.{?}
                 RTA(PHONE) = PHONE ⊓ ∃ dnum.{?}                                                    □
   Mapping a JSON value to a referring expression in a CBox is straightforward with FunDL
concepts. The following definition of ToConcept provides the details. (Recall that Example 1
in our introduction illustrates an invocation of ToConcept on a JSON collection.) Some obser-
vations and reminders: (1) this mapping assumes any JSON value, including an array, will map
to some element of an underlying domain; (2) the mapping relies entirely on interpreting field
names in field-value pairs comprising JSON objects as feature names; and (3) arrays (treated as
sets) are mapped to multi-valued roles reified via the features dom and ran.
Definition 4 (ToConcept). An arbitrary JSON value is mapped to a CBox referring expression as
follows:

                      ToConcept("s") ↦→ {"s"}
                     ToConcept(null) ↦→ ⊤
 ToConcept({"k1 " : 𝑣1 , . . . , "kn " : 𝑣𝑛 }) ↦→ ∃ k1 . ToConcept(𝑣1 ) . . . ⊓ ∃ kn . ToConcept(𝑣𝑛 )
             ToConcept([𝑣1 , . . . , 𝑣𝑚 ]) ↦→ ∃ dom− .∃ ran. ToConcept(𝑣1 ) ⊓
                                                             . . . ⊓ ∃ dom− .∃ ran. ToConcept(𝑣𝑚 )

where the first case covers all JSON values that are strings, numerics, and Booleans.               □


3. CBox Normalization
Our CBox normalization procedure derives from a pair of normalization rules presented in
Section 3.2. To enable references to sub-concepts in a concept description required by our
formulation of these rules, we define the mapping ToAbox.
Definition 5 (ToABox). An arbitrary referring expression 𝐶 is mapped to an ABox and a mapping
from constant symbols 𝑎𝑖 in this ABox to nodes in the syntactic tree of 𝐶 as follows:

                 ToAbox(𝑎 : {𝑏}) ↦→ {𝑎 = 𝑏}
                    ToAbox(𝑎 : A) ↦→ {A(𝑎)}, A primitive
                ToAbox(𝑎 : ∃𝑓.𝐶) ↦→ {𝑓 (𝑎) = 𝑏} ∪ ToAbox(𝑏 : 𝐶), 𝑏 fresh
              ToAbox(𝑎 : ∃𝑓 −1 .𝐶) ↦→ ⋃︀ {𝑓 (𝑏) = 𝑎} ∪ ToAbox(𝑏 : 𝐶), 𝑏 fresh
                                           𝑛
         ToAbox(𝑎 : 𝐶1 ⊓ . . . ⊓ 𝐶𝑛 ) ↦→   𝑖=1 {𝑎 = 𝑏𝑖 } ∪ ToAbox(𝑏𝑖 : 𝐶𝑖 ), 𝑏𝑖 fresh

                                                                                                    □
                       ⋃︀
We define ToAbox(𝒞) = 𝐶∈𝒞 ToAbox(𝑎 : 𝐶), 𝑎 fresh for each 𝐶.
   The ToAbox(𝐶) function converts an input concept 𝐶 to an ABox by traversing the syntactic
structure of the concept, assigning distinct constant symbols 𝑎 to subconcepts of 𝐶, and then
replacing these subconcepts by atomic ABox assertions.
Definition 6 (Context). Let 𝐶 be a concept description. We use the notation 𝐶[𝑎′ : 𝐶 ′ ] to denote a
subconcept 𝐶 ′ of the concept 𝐶 where 𝑎′ is the constant symbol assigned to 𝐶 ′ by ToAbox(𝑎 : 𝐶).
For the top-level concept we simply use the context [𝑎 : 𝐶].
   A projection function ToRE that uses a referring expression type 𝑅𝑒 to generate a referring
expression 𝐶 in ℒ(𝑅𝑒) and holding for a particular individual 𝑎 in a knowledge base (or
producing an undefined value if no such 𝐶 exists) is now defined. (Observe the use of an
auxiliary recursive function with the same name in the definition that takes a path function Pf
as a third argument.)
Definition 7 (Projection on 𝑅𝑒). Let 𝒦 be a consistent knowledge base and 𝑎 a constant. We define
a projection function ToRE(𝑎, 𝑅𝑒) for a referring expression 𝑅𝑒 as the result of the following
recursive definition of ToRE(𝑎, 𝑅𝑒, id ) on the structure of 𝑅𝑒:
             ToRE(𝑎, A, Pf) = A if 𝒦 |= 𝑎 : ∃Pf.A, undefined otherwise
           ToRE(𝑎, {?}, Pf) = {𝑏} if 𝒦 |= 𝑎 : ∃Pf.{𝑏} for some 𝑏, undefined otherwise
       ToRE(𝑎, ∃Pf ′ .𝑅𝑒, Pf) = ∃Pf ′ .ToRE(𝑎, 𝑅𝑒, Pf . Pf ′ )
     ToRE(𝑎, 𝑅𝑒1 ⊓ 𝑅𝑒2 , Pf) = ToRE(𝑎, 𝑅𝑒1 , Pf) ⊓ ToRE(𝑎, 𝑅𝑒2 , Pf) if both defined
      ToRE(𝑎, 𝑅𝑒1 ; 𝑅𝑒2 , Pf) = ToRE(𝑎, 𝑅𝑒1 , Pf) if defined, ToRE(𝑎, 𝑅𝑒2 , Pf) otherwise         □
The following is a simple consequence of this and our previous definitions. Also, see earlier
work in [5, 6] for effective ways of computing the second and fourth base cases.
Lemma 1. For any constant 𝑎, any 𝑅𝑒, and any consistent 𝒦, ToRE(𝑎, 𝑅𝑒) ∈ ℒ(𝑅𝑒).
  It will also be useful to have a simplification procedure for referring concepts. The following
definition of such a procedure will suffice for illustrative purposes, and clearly preserves concept
equivalence.
Definition 8 (Concept Simplification). We write Simplify(𝐶) to denote an exhaustive application
of the following to referring expression 𝐶:
   1. If an 𝑛-way conjunction contains ∃𝑓.𝐶1 and ∃𝑓.𝐶2 , replace both conjuncts by ∃𝑓.𝐶1 ⊓ 𝐶2 .
   2. If an 𝑛-way conjunction contains duplicate conjuncts, remove one of the conjuncts.    □

3.1. CBox Admissibility
In this subsection, we show how, given a TBox 𝒯 , one can statically test for admissibility of
any CBox obtained by the normalization rules given in Subsection 3.2 that follows. This is
achieved by a mapping to a sequence of logical consequence problems for inclusion dependencies
expressing functional dependencies with PFDs that are induced by a given RTA assignment.
We begin by defining a normalization of an 𝑅𝑒 that preserves ℒ(𝑅𝑒).
Definition 9 (Normalized Types; from [1]). We write Norm(𝑅𝑒) to refer to an exhaustive
application of the following rewrite rules to 𝑅𝑒:
                          𝑅𝑒 ⊓ (𝑅𝑒1 ; 𝑅𝑒2 ) ↦→ 𝑅𝑒 ⊓ 𝑅𝑒1 ; 𝑅𝑒 ⊓ 𝑅𝑒2
                          (𝑅𝑒1 ; 𝑅𝑒2 ) ⊓ 𝑅𝑒 ↦→ 𝑅𝑒1 ⊓ 𝑅𝑒1 ; 𝑅𝑒2 ⊓ 𝑅𝑒
                           ∃Pf.(𝑅𝑒1 ; 𝑅𝑒2 ) ↦→ ∃Pf.𝑅𝑒1 ; ∃Pf.𝑅𝑒2                                  □
   The following are simple consequences: (1) ℒ(𝑅𝑒) = ℒ(Norm(𝑅𝑒)), and (2) all preference
operators (“;”) are at the top level. We call the maximal “;”-free parts of Norm(𝑅𝑒𝑡) preference-free
components.
   To statically test for singularity of referring expressions generated by the ToRE function for
a particular referring expression type, we use the following auxiliary definitions:

        Pfs({?}) = {id }                              Con({?}) = ⊤
          Pfs(A) = { }                                 Con(A) = A
    Pfs(∃Pf ′ .𝑅𝑒) = {Pf ′ . Pf | Pf ∈ Pfs(𝑅𝑒)}   Con(∃Pf ′ .𝑅𝑒) = ∃Pf ′ .Con(𝑅𝑒)
  Pfs(𝑅𝑒1 ⊓ 𝑅𝑒1 ) = Pfs(𝑅𝑒1 ) ∪ Pfs(𝑅𝑒2 )       Con(𝑅𝑒1 ⊓ 𝑅𝑒1 ) = Con(𝑅𝑒1 ) ⊓ Con(𝑅𝑒2 )

These functions extract a set of paths leading to nominals and a concept from a preference-free
referring expression type. Altogether, we are now able to formulate the singularity test following
the ideas presented in [1], Theorem 20:

Theorem 1. Let 𝒯 be a TBox and 𝑅𝑒 a referring expression type. Then all referring expressions
in ℒ(𝑅𝑒) are singular if 𝒯 |= Con(𝑅𝑒′ ) ⊑ Con(𝑅𝑒′ ) : Pfs(𝑅𝑒′ ) → id for every preference-free
component 𝑅𝑒′ of Norm(𝑅𝑒).

  Our static test of admissibility of any CBox generated by our normalization rules then follows
by applying the above to any 𝑅𝑒 in the range of a programmer supplied RTA.

3.2. CBox Normalization Rules
We now have the necessary machinery to present our two rules for normalizing the CBox of a
given knowledge base 𝒦 = (𝒯 , 𝒞) and referring expression type assignment RTA. Our first
main rule extracts sub-concepts of a given CBox concept 𝐶 as an additional separate CBox
concept.

Definition 10 (Subdocument Extraction). Assume 𝐶 is a concept in 𝒞 which contains a subcon-
cept 𝐶 ′ (i.e., 𝐶[𝑎 : 𝐶 ′ ]) corresponding to a JSON object. Also assume two properties hold: that
(𝒯 , ToAbox(𝒞)) |= 𝐴(𝑎), and that RTA(𝐴) is defined for primitive concept 𝐴. We form a new
CBox 𝒞 ′ as follows

         𝒞 ′ := 𝒞 − {𝐶[𝑎 : 𝐶 ′ ]} ∪ {𝐶[𝑎 : ToRE(𝑎, RTA(𝐴))], ToRE(𝑎, RTA(𝐴)) ⊓ 𝐶 ′ }

if ToRE(𝑎, RTA(𝐴)) is defined. If an entity 𝐴 cannot be properly identified, we report a warning.
                                                                                               □

   Intuitively, we replace a single monolithic concept 𝐶 in 𝒞 in which a subconcept 𝐶 ′ was
identified as a representation of an 𝐴 entity by a modified variant of 𝐶 in which 𝐶 ′ has been
replaced by its referring expression. In addition we create a new CBox concept ToRE(𝑎, 𝑅𝑒(𝐴)) ⊓
𝐶 ′ for this entity.
   Note that the choice of 𝐴 above is non-deterministic, but does not affect the soundness of the
extraction. However, this non-determinism can result in different referring expressions used to
identify the same subdocument. Although beyond the scope this paper, in [7], we described a
technique that allows one to diagnose that a given RTA has anticipated any such ambiguity,
that any choice of 𝐴 must still lead to the same syntactic referring expression.4 We refer to
such an RTA as identity resolving.
  Also note that, due to the properties of referring expressions (singularity in particular) we
have (𝒯 , ToAbox(𝒞 ′ )) |= 𝑎 = 𝑏 for the above-introduced concepts 𝐶[𝑎 : ToRE(𝑎, RTA(𝐴))]
and [𝑏 : ToRE(𝑎, RTA(𝐴)) ⊓ 𝐶 ′ ] in 𝒞 ′ . Hence, the models of (𝒯 , 𝒞) and (𝒯 , 𝒞 ′ ) will coincide.
  The second of our two rules replaces two CBox referring expressions with a single referring
expression when co-reference is implied. Note that, were the given RTA identity resolving, we
would always have 𝐷 = 𝐷′ .

Definition 11 (Equivalent Subdocument Merge). Let [𝑎 : 𝐷 ⊓ 𝐶] and [𝑏 : 𝐷′ ⊓ 𝐶 ′ ] be two
concepts in 𝒞 such that (𝒯 , ToAbox(𝒞)) |= 𝑎 = 𝑏. We replace 𝒞 with

              𝒞 ′ := 𝒞 − {[𝑎 : 𝐷 ⊓ 𝐶], [𝑏 : 𝐷′ ⊓ 𝐶 ′ ]} ∪ {[𝑎 : 𝐷 ⊓ Simplify(𝐶 ⊓ 𝐶 ′ )}.                           □

   Our main results now follow.

Theorem 2. Let 𝒞 be a CBox in a consistent knowledge base 𝒦 = (𝒯 , 𝒞) and 𝒞 ′ a CBox obtained
by applying the Subdocument Extraction or the Equivalent Subdocument Merge rules. Then
every model of 𝒦 is also a model of (𝒯 , 𝒞 ′ ) and vice versa.

   Strictly speaking, this is not true for (𝒯 , ToAbox(𝒞)) and (𝒯 , ToAbox(𝒞 ′ )), but the models of
these two knowledge bases will only differ in what constants are assigned to what subconcepts
in 𝒞 and 𝒞 ′ , respectively.

Theorem 3. Let 𝒞 be an admissible CBox in a consistent knowledge base 𝒦 = (𝒯 , 𝒞) and 𝒞 ′ a
CBox obtained by applying the Subdocument Extraction or the Equivalent Subdocument Merge
rules. Then 𝒞 ′ is admissible.


4. Summary Comments
The main contribution of this paper is showing how JSON-like data sources can be abstracted
as concept descriptions in an appropriate DL in a very generic way. This enables their domain-
specific semantics to be naturally captured as a TBox in the same logic, which then allows one
to draw on mature reasoning services that have been developed for DLs [8, 4]. In addition,
our approach utilizes referring expressions [1] as the means of identifying entities described in
such data sources. This is crucial to the proposal’s ability to detect multiple subdocuments that
provide information about the same entity. Notably, this is achieved completely automatically
since the appropriate equalities will be entailed by the knowledge base consisting of a domain-
specific TBox and the data sources captured as concepts in a CBox.
   Our primary technical contribution is our ability to separate entities into separate documents
and to consolidate documents that provide information about the same entity, also by appeal to
referring expressions and to entailment in the underlying DL.

    4
      Essentially, this entails ensuring that preference in RTA(𝐴), for any primitive concept 𝐴, exhaustively accounts
for any primitive concept 𝐵 for which there exists some interpretation ℐ for which (𝐴 ⊓ 𝐵)ℐ is non-empty.
4.1. Related Work
There are many papers that take as input a semi-structured document, in JSON or XML (some-
times with a schema) plus an ontology in a DL TBox, and create individual instance descriptions
in an ABox; for a survey see [9]. Usually, this is intended for just adding semantics to the
document, but reasoning could be used to detect inconsistencies, such as a situation where two
properties with disjoint domains apply to the same individual. For example, the use of XPath to
navigate XML documents and detect instances of objects belonging to certain OWL classes has
been presented in [10]. In particular, their approach and implementation, i.e., their JXML2OWL
framework, maps XML documents to existing OWL ontologies via explicit mapping rules that
use XPath. Out approach, in contrast, uses a generic mapping of JSON to concept descriptions
in a DL and then uses the full power of reasoning in the DL to capture such mappings and to
achieve a variety other goals, including detection of entities and entity-based equality between
subdocuments.
   The closest to our work is a virtual OBDA architecture, where data is stored in a JSON
MongoDB repository [11]. The architecture extends the basic OBDA architecture by introducing
a relational view (an ABox) over MongoDB with respect to a set of type constraints. Rewritten
queries by the OBDA framework over the relational view are translated using a fragment of
MongoDB aggregate queries. However, as far as we are aware, functional dependencies are not
involved in identification issues, nor in separating entities into separate documents.
   There is a great deal of work that attempts to synthesize schemata (in various formalisms)
from the semi-structured data [9]. However, these approaches are orthogonal to the results in
our paper.

4.2. Future work and Extensions
There are many avenues for future research:
   1. Additional CBox entries generated by reified roles: in Example 2, we could also have
      created additional CBox entries for (reified) phone ownership, e.g.,

   HAS-PHONE ⊓ ∃ dom.phone−1 .(PERSON ⊓ ∃ fname.{"John"} ⊓ ∃ lname.{"Smith"})
                 ⊓ ∃ ran.(PHONE ⊓ ∃ dnum{"212 555-1234"}).

      This, however, requires generalizing the PFD concept constructor and path descriptions
      to allow for a limited use of inverse features.
   2. Diagnosis via consistency and pinpointing/data cleaning: inconsistency of the knowledge
      base consisting of the domain knowledge 𝒯 and the CBox ToConcept(𝑑𝑜𝑐) indicates that
      either our domain knowledge does not accurately capture the properties of the documents
      or that the documents themselves contain erroneous data. Axiom pinpointing [12] and
      data cleaning [13] can help in this situation.
   3. Set-valued properties and referring expressions: another extension relates to extending
      the RTAs to allow identification to be based on set-valued properties/values. Such an
      extension, however, requires extensions to the equality-generating constraints in the
      underlying DL and the ToRE operation.
References
 [1] A. Borgida, D. Toman, G. Weddell, On referring expressions in query answering over first
     order knowledge bases, in: Proc. KR, 2016, pp. 319–328.
 [2] D. Toman, G. E. Weddell, Identity resolution in ontology based data access to structured data
     sources, in: A. C. Nayak, A. Sharma (Eds.), PRICAI 2019: Trends in Artificial Intelligence -
     16th Pacific Rim International Conference on Artificial Intelligence, Part I, volume 11670
     of Lecture Notes in Computer Science, Springer, 2019, pp. 473–485.
 [3] S. McIntyre, D. Toman, G. E. Weddell, FunDL - A family of feature-based description
     logics, with applications in querying structured data sources, in: Description Logic, Theory
     Combination, and All That - Essays Dedicated to Franz Baader on the Occasion of His
     60th Birthday, 2019, pp. 404–430.
 [4] S. McIntyre, A. Borgida, D. Toman, G. E. Weddell, On limited conjunctions and partial
     features in parameter-tractable feature logics, in: The Thirty-Third AAAI Conference on
     Artificial Intelligence, AAAI 2019, 2019, pp. 2995–3002.
 [5] J. Pound, D. Toman, G. E. Weddell, J. Wu, Query algebra and query optimization for
     concept assertion retrieval, in: V. Haarslev, D. Toman, G. E. Weddell (Eds.), Proceedings
     of the 23rd International Workshop on Description Logics (DL 2010), Waterloo, Ontario,
     Canada, May 4-7, 2010, volume 573 of CEUR Workshop Proceedings, CEUR-WS.org, 2010.
 [6] J. Pound, D. Toman, G. E. Weddell, J. Wu, An assertion retrieval algebra for object queries
     over knowledge bases, in: T. Walsh (Ed.), IJCAI 2011, Proceedings of the 22nd International
     Joint Conference on Artificial Intelligence, Barcelona, Catalonia, Spain, July 16-22, 2011,
     IJCAI/AAAI, 2011, pp. 1051–1056.
 [7] A. Borgida, D. Toman, G. E. Weddell, On referring expressions in information systems
     derived from conceptual modelling, in: I. Comyn-Wattiau, K. Tanaka, I. Song, S. Yamamoto,
     M. Saeki (Eds.), Conceptual Modeling - 35th International Conference, ER 2016, Gifu, Japan,
     November 14-17, 2016, Proceedings, volume 9974 of Lecture Notes in Computer Science,
     2016, pp. 183–197.
 [8] F. Baader, D. Calvanese, D. L. McGuinness, D. Nardi, P. F. Patel-Schneider, The Description
     Logic Handbook: Theory, Implementation, and Applications, Cambridge University Press,
     2003.
 [9] M. Hacherouf, S. N. Bahloul, C. Cruz, Transforming xml documents to owl ontologies: A
     survey, Journal of Information Science 41 (2015) 242–259.
[10] T. Rodrigues, P. Rosa, J. Cardoso, Mapping xml to exiting owl ontologies, in: International
     Conference WWW/Internet, Citeseer, 2006, pp. 72–77.
[11] E. Botoeva, D. Calvanese, B. Cogrel, M. Rezk, G. Xiao, OBDA beyond relational dbs: A study
     for mongodb, in: M. Lenzerini, R. Peñaloza (Eds.), Proceedings of the 29th International
     Workshop on Description Logics, Cape Town, South Africa, April 22-25, 2016, volume 1577
     of CEUR Workshop Proceedings, CEUR-WS.org, 2016.
[12] R. Peñaloza, Axiom pinpointing, in: G. Cota, M. Daquino, G. L. Pozzato (Eds.), Applications
     and Practices in Ontology Design, Extraction, and Reasoning, volume 49 of Studies on the
     Semantic Web, IOS Press, 2020, pp. 162–177.
[13] I. F. Ilyas, X. Chu, Data Cleaning, ACM, 2019. URL: https://doi.org/10.1145/3310205. doi:10.
     1145/3310205.