Extracting JSON Schemas with tagged unions
Stefan Klessinger1 , Meike Klettke2 , Uta Störl3 and Stefanie Scherzinger1
1
  University of Passau, Passau, Germany
2
  University of Regensburg, Regensburg, Germany
3
  University of Hagen, Hagen, Germany


                                       Abstract
                                       With data lakes and schema-free NoSQL document stores, extracting a descriptive schema from JSON data collections is an
                                       acute challenge. In this paper, we target the discovery of tagged unions, a JSON Schema design pattern where the value of
                                       one property of an object (the tag) conditionally implies subschemas for sibling properties. We formalize these implications
                                       as conditional functional dependencies and capture them using the JSON Schema operators if-then-else. We further motivate
                                       our heuristics to avoid overfitting. Experiments with our prototype implementation are promising, and show that this form
                                       of tagged unions can successfully be detected in real-world GeoJSON and TopoJSON datasets. In discussing future work, we
                                       outline how our approach can be extended further.


1. Introduction                                                                                   1 { " type ": " G e o m e t r y C ol l e c t i o n " ,
                                                                                                  2   " geometries ": [
                                                                                                  3      { " type ": " Point " ,
JSON is a popular data exchange format. Extracting a                                              4        " coordinates ": [30 ,10] } ,
schema from collections of JSON documents is a real-                                              5      { " type ": " Point " ,
                                                                                                  6        " coordinates ": [40 ,15] } ,
world challenge which is actively being researched [1, 2,                                         7      { " type ": " LineString " ,
3, 4, 5, 6, 7, 8].                                                                                8        " coordinates ": [[55 ,5] , [10 ,30] , [10 ,10]] } ,
                                                                                                  9      { " type ": " LineString " ,
   Ideally, the extracted schema describes the data tightly,                                     10        " coordinates ": [[30 ,10] , [10 ,30] , [40 ,40]] }
yet without overfitting. In this article, we target the de-                                      11   ] }

tection of a specific schema design pattern in the JSON
Schema language, the pattern of tagged unions, also known Figure 1: GeoJSON data. The value of property type (here
as discriminated unions, labeled unions, or variant types. Point and LineString) determines the subschema of prop-
This is a recommended design pattern [9], and has been erty coordinates as either encoding a geometric point (an
                                                             array of numbers), or a line (an array of points).
found to be quite common in real-world schemas [10].
Example 1. Figure 1 shows GeoJSON data. The array
starting in line 2 holds four objects, of type Point and
                                                                                                 type with the value Point, then the value of property
LineString. Point coordinates are encoded as an array
                                                                                                 coordinates must be an array of numbers. Else, if the
of numbers, while lines are encoded as an array of points,
and hence, an array of number arrays.                                                            property type has the value LineString, the value of
                                                                                                 coordinates must be an array of points, and hence, an
   The GeoJSON specification (IETF RFC 7946, clickable
link embedded in the PDF) describes six types of ge-                                             array of number arrays.
ometries, including polygons and multi-polygons. Con-                                          Example 2. Various schemas listed on SchemaStore.org,
sistently, the property type serves as a tag to distin-                                        a community-curated collection of JSON Schema dec-
guish the subschema of sibling property coordinates,                                           larations, encode tagged unions using if-then-else.
thereby instantiating a tagged union.                                                          Among them are the Minecraft schemas which use tagged
   GeoJSON comes without an official JSON Schema spec-                                         unions to encode so-called data packs for configuring
ification [11]. In Figure 2, we therefore show a hand-                                         Minecraft worlds. Other examples are Github Issue Forms,
crafted excerpt using the if-then-else construct to                                            or the cloudify schema (clickable links in the PDF).
enforce the tagged union, with the following seman-
tics: If the object in question has a property labeled                                                     In schema extraction, the detection of tagged unions
                                                                                                        has so far received limited attention: Several approaches
                                                                                                        are able to detect union types, i.e., properties whose type
DEco@VLDB’22: First International Workshop on Data Ecosystems,
September 5, 2022, Sydney, Australia
                                                                                                        is a union of types, but they do not detect the depen-
$ stefan.klessinger@uni-passau.de (S. Klessinger);                                                      dency w.r.t. a specific property value (the “tag” in the
meike.klettke@ur.de (M. Klettke); uta.stoerl@fernuni-hagen.de                                           tagged union). Figure 3a shows an example: This schema
(U. Störl); stefanie.scherzinger@uni-passau.de (S. Scherzinger)                                         for GeoJSON data does not restrict the values for prop-
 0000-0003-0551-8389 (M. Klettke); 0000-0003-2771-142X                                                 erty type (it allows all strings) and allows alternative
(U. Störl)
           © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License subschemas for property coordinates. As it is overly
    CEUR
           Attribution 4.0 International (CC BY 4.0).
           CEUR Workshop Proceedings (CEUR-WS.org)
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                                                                                        general, the schema allows to encode lines as mere ar-


                                                                                            27
 1 " if ": {                                                                 by state-of-the-art third-party tools.
 2    " properties ": {                                                    • Our experiments on real-world GeoJSON datasets
 3       " type ": { " const ": " Point " } } ,
 4    " required ": [ " type " ] } ,                                         (and its sibling format TopoJSON) show that mean-
 5 " then ": {
 6    " properties ": {
                                                                             ingful tagged unions can indeed be identified. We
 7       " coordinates ": {                                                  illustrate the impact of a configurable threshold
 8          " type ": " array " ,
 9          " items ": { " type ": " number " } } } } ,
                                                                             on the number of tagged unions detected, and
10 " else ": {                                                               consequently, the size of the extracted schema.
11    " if ": {
12       " properties ": {                                               We further outline promising directions of future work.
13          " type ": { " const ": " LineString " } } ,
14       " required ": [ " type " ] } ,
15    " then ": {                                                      Artifact availability and reproducibility. We have
16       " properties ": {
17          " coordinates ": {
                                                                       made our research artifacts (code, data, extracted schemas),
18             " type ": " array " ,                                   as well as a fully automated reproduction package, long-
19             " items ": {
20                " type ": " array " ,
                                                                       term available online [12].
21                " items ": { " type ": " number " } } } } } }

                                                                       Structure of this paper. In Section 2, we introduce
Figure 2: JSON Schema snippet declaring a tagged union for             the preliminaries. Section 3 discusses related work. In
GeoJSON geometry objects Point and LineString.                         Section 4, we present our approach, specifically the archi-
                                                                       tecture and heuristics employed, and an outlook on future
                                                                       extensions. In Section 5, our experiments are presented
rays of numbers (rather than arrays of number-arrays),                 and discussed. In Section 6, we draw our conclusions and
in violation of the GeoJSON specification.                             discuss opportunities for future work.
   Existing approaches to JSON Schema extraction fail
to detect tagged unions. Notably, in describing their
approach to schema extraction based on typing JSON
                                                                       2. Preliminaries
values, Baazizi et al. [2] outline how their approach can
                                                                       2.1. JSON Data Model and JSON Schema
be extended to include tagged unions. However, they do
not discuss strategies against overfitting to the input data       JSON data model.
during the discovery of tagged unions.                             The grammar below (adopted from [13]) captures the syn-
   Our proposal follows a different approach, and relies           tax of JSON values, namely basic values, objects, or arrays.
on a relational encoding of JSON objects from which we             Basic values 𝐵 include the null value, Booleans, num-
then derive conditional dependencies. A central part of            bers 𝑛, and strings 𝑠. Objects 𝑂 represent sets of mem-
                                                                   bers, each member being a name-value pair, arrays 𝐴
our contribution are our heuristics, which filter out de-          represent sequences of values.
pendencies that have insufficient support in the input
data, so that they are not reflected in the derived schema.            𝐽 ::= 𝐵 | 𝑂 | 𝐴
Specifically, ours is the first proposal towards the discov-           𝐵 ::= null | true | false | 𝑛 | 𝑠    𝑛 ∈ Num, 𝑠 ∈ Str
ery of tagged unions that — to our knowledge — includes                𝑂 ::= {𝑙1 : 𝐽1 , . . . , 𝑙𝑛 : 𝐽𝑛 }   𝑛 ≥ 0, 𝑖 ̸= 𝑗 ⇒ 𝑙𝑖 ̸= 𝑙𝑗
                                                                       𝐴 ::= [𝐽1 , . . . , 𝐽𝑛 ]             𝑛≥0
an experimental evaluation over real-world data.

                                                                   JSON Schema. JSON Schema is a language for defin-
Contributions. This paper makes the following con-
                                                                   ing the structure of JSON documents. The syntax and
tributions:
                                                                   semantics of JSON Schema have been formalized in [14],
     • We target the detection of tagged unions in JSON
                                                                   and we informally present some of the main keywords:
        schema extraction, specifically, tagged unions
                                                                      Assertions include required, enum, const, pattern
        that are based on dependencies between a prop-
                                                                   and type, and indicate a test that is performed on the
        erty value and the implied subschema of its sib-
                                                                   corresponding instance.
        ling properties.
                                                                      Applicators include the Boolean operators not, anyOf,
     • Our approach relies on the discovery of unary
                                                                   allOf, oneOf, as well as, if-then-else. They fur-
        constant conditional functional dependencies in
                                                                   ther include the object operators patternProperties,
        a relational encoding of the JSON objects. Tradi-
                                                                   additionalProperties, and properties, and the ar-
        tionally, conditional functional dependencies are
                                                                   ray operator items, and the reference operator $ref.
        employed in the context of data cleaning, and we
                                                                   They apply a different operator to the same instance or
        apply them to a new domain.
                                                                   to a component of the current instance.
     • Our approach is composable with existing algo-
                                                                      Annotations (title, description, and $comment)
        rithms for JSON schema extraction, as we impose
                                                                   do not affect validation but indicate an annotation asso-
        the tagged unions on top of the schemas derived
                                                                   ciated to the instance.


                                                                  28
   1 " anyOf ": [                                                 1 " anyOf ": [
   2    { " type ": " object " ,                                  2    { " type ": " object " ,
   3       " properties ": {                                      3       " properties ": {
   4          "type": { "type": "string" },                       4          "type": { "const": "Point" },
   5          " coordinates ": {                                  5          " coordinates ": {
   6             " type ": " array " ,                            6             " type ": " array " ,
   7             " items ": { " type ": " number " } } }          7             " items ": { " type ": " number " } } }
   8    },                                                        8    },
   9    { " type ": " object " ,                                  9    { " type ": " object " ,
  10       " properties ": {                                     10       " properties ": {
  11          "type": { "type": "string" },                      11          "type": { "const": "LineString" },
  12          " coordinates ": {                                 12          " coordinates ": {
  13             " type ": " array " ,                           13             " type ": " array " ,
  14             " items ": { " type ": " array " ,              14             " items ": { " type ": " array " ,
  15                " items ": { " type ": " number " }          15                " items ": { " type ": " number " }
  16     } } } } ]                                               16     } } } } ]


                          (a) Union type.                                               (b) Tagged union.
Figure 3: Snippets of different encodings of GeoJSON geometries in JSON Schema. Left: Union type encoding, as extracted by
traditional schema extraction tools (with syntactic variations, depending on the tool). Right: Tagged union encoding using
anyOf, as an alternative to the if-then-else construct. While near-identical in syntax, differing only in lines 4 and 11, the
schema semantics differ, as the union type allows incorrect encodings of points or line strings in GeoJSON.


Union types A property with several possible types             “Point” with coordinates encoded as an array of numbers
(or subschemas) can be described as a union type [15],         or a “LineString” with an array of points.
i.e., the union of all types the property assumes. In JSON        Note that while Figures 3a and 3b differ only marginally
Schema, this can be encoded by a disjunction, using the        in their syntax, the difference in semantics is striking:
union operators anyOf or oneOf (where the former is            The tagged union captures the dependency between the
inclusive, and the latter exclusive). Such an encoding is      value of property type and the subschema of property
exemplified in Figure 3a for GeoJSON. Union types are          coordinates. While union type encodings can be de-
recognized by most of the existing tools for JSON Schema       rived by several state-of-the-art schema extraction tools,
extraction (as we will discuss in greater detail in our dis-   schemas with tagged unions are – so far – manually
cussion of related work). While the schema distinguishes       crafted, since existing approaches to schema extraction
two variants for encoding coordinates (an array of             are not capable of discovering value-based dependencies.
numbers, or an array thereof), it does not capture any            Our approach can produce either the if-then-else
dependencies between the value of property type and            encoding or the encoding exemplified in Figure 3b. We
the subschema for the sibling property coordinates.            chose to implement the former, because it is more lenient
Also, the domain of property type is not restricted to         w.r.t. unexpected tag values (in this case no restrictions
specific string values.                                        are specified). Note that this is not a limitation of our
                                                               approach, and a merely technical limitation.
Tagged unions The if-then-else operator allows
for declaring tagged unions, and was introduced (rather          2.2. Dependencies
recently) with JSON Schema Draft 7.
   In Example 1, we informally introduced the semantics.       For the relational model, the concept of functional depen-
As for terminology, we distinguish one property as the         dencies is well explored, and various generalizations are
tag (in our example, the property labeled type), and           known, such as conditional functional dependencies [16]
identify one or more properties with implied subschemas        which only apply to a subset of the tuples. In the follow-
(in our example, for coordinates).                             ing, we extend these notions to JSON data, assuming a
   The tag-property may go by any name. While the              relational encoding of all objects that are reachable by
schemas for GitHub Issue Forms and cloudify from Ex-           the same path from the document root. This idea of a
ample 2 incidentally also rely on a tag labeled “type”, in     relational encoding of semi-structured data for the defini-
the example of Minecraft (see data snippets in Figure 5),      tion or detection of dependencies is a common approach,
the tag-property is labeled “condition”.                       e.g., for XML [17], JSON [18], or RDF data [19].
   An alternative encoding for tagged unions is to use a
union operator, as shown in Figure 3b. Two different com-        Relational encoding Given a JSON value, we consider
binations of values for property type and subschemas             all labeled paths from the root to a JSON object. Paths
for property coordinates are given in accordance with            may be encoded in JSONPath [20], a straightforward path
the GeoJSON definition. Again, an object may either be a         language for JSON.


                                                            29
   We next introduce the schema for our relational en-           Dependencies We next introduce dependencies over
coding. We reserve attribute 𝑂.id for an (internal) object       this relational encoding. Traditional functional depen-
identifier. The object identifier must be unique, but we         dencies (FDs) capture constraints that hold on all tuples
do not impose any constraints on its semantics. In the           of a relation. Moreover, conditional functional dependen-
following, we will simply use the line of code in the file       cies (CFDs) [16] are functional dependencies that hold
containing the JSON Schema declaration (after pretty             on only a subset of the tuples. While in their full gen-
printing), where the scope of the object is first entered.       erality, CFDs are a generalization of classical functional
   We identify the labels of all properties reachable by a       dependencies, we will focus on a very restricted subclass
given path 𝑝:                                                    that is related to association rules [21], and that can be
                                                                 defined quite compactly.
   𝐿𝑝 = {𝑙𝑖 | object {𝑙1 : 𝐽1 , . . . , 𝑙𝑛 : 𝐽𝑛 }                Definition 4. If 𝒜 is a set of attributes, then a unary con-
                                  is reachable by path 𝑝}        stant conditional functional dependency over 𝒜 (ucCFD)
                                                                 is an expression of the form
  For each property label 𝑙𝑖 in 𝐿𝑝 where 𝐽𝑖 occurs as
a basic value, we define a relational attribute 𝑙𝑖 .value                           [𝐴 = 𝑎] → [𝐵 = 𝑏]
that captures the basic value 𝐽𝑖 . These properties are
considered to be candidates for tags.                          where 𝐴, 𝐵 are attributes in 𝒜 and 𝑎, 𝑏 are constants
  Further, for each property label 𝑙𝑖 in 𝐿𝑝 , we define        from the domains of 𝐴 and 𝐵 respectively. A relation
an attribute 𝑙𝑖 .type, capturing the subschema directly        𝑅 over 𝒜 satisfies [𝐴 = 𝑎] → [𝐵 = 𝑏] if for each
derived from its value 𝐽𝑖 .                                    pair of tuples 𝑠, 𝑡 ∈ 𝑅, 𝜋𝐴 (𝑠) = 𝜋𝐴 (𝑡) = 𝑎 implies
  For each object reachable by path 𝑝, we then insert          𝜋𝐵 (𝑠) = 𝜋𝐵 (𝑡) = 𝑏.
one tuple into this relation, choosing some unique object        Example 5. The dependency below holds in Table 1 and
identifier for each object. The attribute values are pop-        reads as follows:
ulated with the semantics described above; null values
mark missing entries.                                             [type.value = "Point"] → [coordinates.type = t1]
Example 3. Table 1 shows the encoding for the JSON
                                                                    The left-hand-side, in brackets, declares a condition
objects from Figure 1 in the array starting from line 3,
                                                                 that must be satisfied for the dependency to hold: We con-
reachable by the JSONPath /geometries[*]. By t1
                                                                 sider all attributes where the value of attribute type.value
and t2, we abbreviate the subschemas directly derived
                                                                 is the string “Point”. For all tuples where this condition is
from the JSON values, requiring an array of numbers (t1)
                                                                 satisfied, the value of attribute coordinates.type must
and an array of arrays of numbers (t2), namely
                                                                 be the subschema abbreviated as t1.
       { "type" : "array",                                         In our domain of application, namely JSON values,
                                                        (t1)    these dependencies express powerful constraints between
           "items": { "type":"number" }}
                                                                property values and subschemas: In the example depen-
and further                                                     dency above, it is implied that if the value of property
                                                                type is the string constant “Point”, then the value of
   { "type": "array", "items":{                                 the sibling property coordinates must conform to the
        "type": "array",                                   (t2) subschema t1.
        "items": { "type": "number" }}}                            In the remainder of this article, we exclusively focus on
                                                                the discovery of such value-type constraints, where the
                                                                attribute on the left-hand-side of a ucCFD is of the form
                                                                “𝐴.value” (the value of property 𝐴, which we consider
Table 1
                                                                to be a candidate for a tag in a tagged union), and the
Relational encoding for the objects in Figure 1 that are reach-
able by path /geometries[*] (JSONPath syntax). Sub- attribute on the right-hand-side is of the form “𝐵.type”
schemas t1 and t2 abbreviated as in Example 3.                  (the subschema of the sibling property 𝐵).
                                                                   Traditionally, tagged unions are declared by switch-
  𝑂.id    type.value     type.type      coordinates.type        ing on the value of a single tag property. This is also
      3 Point            string         t1                      the recommended practice in JSON Schema [9], and in
      5 Point            string         t1                      agreement with what we observe in real-world data [10].
      7 LineString       string         t2                      We are therefore confident that our restriction to unary
      9 LineString       string         t2                      dependencies is justified.


                                                            30
3. Related Work                                               (quantitative comparison). Several of the examined ap-
                                                              proaches also support the extraction of union types: [1]
Our article builds upon the rich body of related work         and [31] use the JSON Schema keyword oneOf, while [2]
in the area of schema extraction and the theory of data       use the union type constructor of their own proprietary
dependencies.                                                 schema language. The authors of [32] encode the ex-
                                                              tracted schema in the XML Schema language, and encode
Schema and constraint definition. Schema lan-                 union types using entity versioning, while [33] pursue
guages for semi-structured data are well-researched.          an alternative approach of reducing different types to
   XML was developed as a semi-structured data format         their most generic type.
with implicit structural information and an optional ex-         But neither Klettke et al. [1] nor Frozza et al. [31]
plicit schema. The simplest schema language for XML is        support tagged unions within their extraction of union
DTD (Document Type Definition). The lack of means to          types. The approach by Baazizi et al. [2] is based on type
express data types and exact cardinalities in DTDs moti-      inference. They achieve scalability by inferring types in
vated the development of further schema languages, such       a MapReduce-based approach. The authors do discuss
as XML Schema [22], Schematron [23], and RelaxNG [24].        the challenge of extracting tagged unions, and describe
All three support the definition of constraints. In XML       an extension to their algorithm to address this challenge.
Schema, alternatives can be defined by specifying con-        However, there is no implementation or evaluation of
ditions on path expressions (XPath). In version 1.1 of        this feature, or of any heuristics to prevent overfitting in
XML Schema, the concept of assertions allows to encode        tagged unions.
constructs such as tagged unions. Schematron can define          In a further, recent contribution, Spoth et al. [3] focus
rules with context information (XPath) and messages           on resolving ambiguities during schema extraction, such
(that are sent in the success or error case). RelaxNG en-     as sets encoded as objects rather than arrays. As their
ables the definition of inference rules. This historical      approach does not consider property values, it cannot
excursion shows the necessity of exact and expressive         detect tagged unions.
schema languages. The same holds true for the descrip-           Durner et al. [34] recently presented an approach for
tion of JSON data. The preliminaries on JSON Schema           fast analytics on semi-structured data. Using various
were already covered in paragraph 2.1.                        algorithms, the JSON data is divided into tiles, and lo-
                                                              cal schemas are extracted. However, tagged unions are
Schema extraction. In relational databases which fol-         neither considered nor recognized.
low a schema-first approach, all available databases have        A completely different approach for schema inference
an explicit schema stored in the databases catalog. With      has been suggested in [35]. In this work, a supervised
semi-structured data, on the other hand, there are many       learning method (based on the well-known C4.5 classifi-
datasets available that have been published without such      cation algorithm) is used for detecting hidden rules in the
explicit schema information. Schema extraction (also          different variants of datasets. These rules can be either
known as reverse engineering) is therefore an important       structure-based or value-based. An interesting observa-
subtask in data profiling for semi-structured data.           tion of this work was that based on an empirical study
                                                              (interviews), value-based rules are considered more im-
XML schema extraction. For schema extraction from             portant by human consumers than structure-based rules
XML documents, different approaches have been devel-          to distinguish variants. The result of the approach is a
oped, (e.g. [25, 26, 27, 28, 29, 30]). In all algorithms, a   decision tree that distinguishes the different schema vari-
simple schema consisting of element and attribute names,      ants with value-based or structure-based conditions on
nesting, optional and required information is derived. We     each edge. Even if the approach in the paper is very differ-
are not aware of any approaches to extracting complex         ent from ours, we recognize the motivation that consid-
schema constraints, such as tagged unions.                    ering value-based conditions in heterogeneous databases
                                                              is important.
JSON schema extraction. Early work on schema ex-
traction from JSON data [1] adds — besides the schema Dependencies. For an overview of data dependencies,
itself — also the extraction of statistics and a detection of we refer to a comprehensive survey [36], and focus on
outliers. In [6], the extraction of schema versions over functional, inclusion, and conditional functional depen-
time, as well as evolution operations mapping between dencies in the following.
consecutive schema versions, is presented.
   Recent surveys of different schema extraction approach- Functional dependencies and inclusion dependen-
es were provided by [4] (qualitative comparison) and [5] cies. Functional dependencies [37, 38, 39, 18] and in-
                                                              clusion dependencies on semi-structured data [40, 18]


                                                          31
define semantic constraints on data to guarantee certain       approach. In previous work on inferring inclusion de-
data characteristics, to normalize data [41], and to ensure    pendencies from JSON data [6], we suggested an efficient
data quality. These constraints can also be applied in         method that uses lattice characteristics to optimize the
data cleaning. In [42], an overview over different meth-       algorithm. To consider outliers, a threshold is introduced.
ods and the semantic constraints, rules, or patterns to        With it, it is possible to derive related inclusion depen-
detect errors in the data cleaning process is given. Schel-    dencies which are violated by a small number of outliers.
ter et al. [43] use declarative rules in unit tests to check      A further relaxation of functional dependency discov-
the data quality (defined with several metrics). Some ex-      ery has been suggested in [49] for data lakes. Here, out-
amples are: completeness of datasets, range conditions,        liers are not ignored, but properties that have different
certain cardinalities and constraints on statistics. Both      yet similar labels are combined. This method can be used
approaches do not infer the semantic constraints from          for data exploration and profiling of datasets with lower
data but show how declarative rules can be leveraged           data quality.
during data preprocessing.
    The discovery of valid uniqueness constraints, func-        Conditional functional dependencies. Conditional
tional dependencies and inclusion dependencies is a well-       functional dependencies [16] were first introduced for
studied field in relational databases [44, 45, 46, 47, 48]      relational data, in the context of data cleaning. We apply
and even more relevant in JSON data (or, more generally,        conditional functional dependencies in a new context,
NoSQL data) because often these semantic constraints            the relational encoding of JSON objects.
are not predefined in the NoSQL databases. They are im-            In [50, 51], three algorithms for CFD mining are pro-
plicitly available in the data but for data cleaning tasks,     posed and evaluated: CTANE and FastCFD build upon
this information is required in form of rules and con-          existing algorithms for FD discovery, TANE and FastFD,
straints. Hence, development of algorithms to derive            respectively and are intended for general CFD discov-
explicit semantic constraints from NoSQL data is of par-        ery. The third algorithm is custom-designed for constant
ticular importance.                                             CFDs, as also targeted by us.
    Methods for deriving semantic constraints from data            In [52], additional rules for pruning the search space
can build upon the algorithms developed for relational          and consequently speeding up the mining process are
databases. Additionally, the algorithms have to scale           proposed for constant CFDs. Further approaches, based
with large volumes of data, and be robust despite the           on FD discovery and pattern mining, are explored in [21].
heterogeneity of datasets (variety) and low data quality
(outliers in datasets).
    Arenas and Libkin define functional and inclusion de-       4. Approach
pendencies for XML [17], as a basis for schema normal-
ization. More recently, Mior [18] targets the mining of        Our end-to-end approach is sketched in Figure 4. In our
functional and inclusion dependencies from JSON data.          upcoming walk-through of the architecture, we first focus
Both formalisms do not allow to capture the conditional        on the basic approach, as well as our heuristics against
functional dependencies required in our context.               overfitting and schema bloat. We then outline future
    However, Mior compares the performance of depen-           work extensions to robustly generalize our approach to
dency discovery in a relational encoding of the JSON           a larger family of tagged unions.
data (termed “static unrolling”) with a dynamic unrolling
technique, showing that the latter has superior runtime         4.1. Architecture
performance. Mior focuses on the discovery of functional
                                                               Composability. We design our approach to be compos-
dependencies and inclusion dependencies. While his ap-
                                                               able with existing schema extraction algorithms: Given a
proach does not consider the special case of conditional
                                                               collection of JSON values, we extract a JSON Schema de-
functional dependencies, which is relevant in our context,
                                                               scription 𝑆, using any third-party tool. Along the process
it is capable of deriving approximate dependencies, thus
                                                               sketched along the bottom, we further derive the declara-
being more robust w.r.t. vagueness in the input data.
                                                               tion of tagged unions, denoted 𝑇 , a sequence of (possibly
    Kruse et al. developed in [40] the scalable discovery of
                                                               nested) JSON Schema if-then-else statements.
inclusion dependencies from data. Scalability of the ap-
                                                                  By the conjunction of the subschemas as
proach is achieved by two reasons: the approach is only
                                                               {"allOf": [S,𝑇 ]}, the composite schema requires the
concentrating on discovery of unary inclusion depen-
                                                               JSON values to satisfy both subschemas, imposing tagged
dencies (consisting of one attribute on the left-hand and
                                                               unions on schema 𝑆. This approach allows us to focus
on the right-hand side) and is developing a distributed
                                                               on the particular challenge of identifying tagged unions,
approach. Combining both, a very efficient algorithm
                                                               while conveniently leveraging state-of-the-art schema
can be developed for analyzing large datasets. The het-
                                                               extraction tools for describing the other aspects of the
erogeneity of datasets has not been considered in this


                                                           32
                   Schema 𝑆                                                                                             JSON
  JSON                                                                                                          8
                                                                                                                        Schema
  Schema
                    Parse                  Comp.            Position                                                    {allOf:[𝑆 ,𝑇 ]}
                                        Records of Rel.       List               CFDs               if-then-else 7
                    Tree
        1                                 Encoding          Indices           Candidate List        Constraints 𝑇 CFDs List
                                              XYZ           𝜋X ={. . . }      [X=v1 ]→[Y=t1 ]         [X=v1 ]→[Y=t1 ]
    JSON                                      · · ·         𝜋Y ={. . . } 5
             2       A      B      C     3    · · · 4
                                                                              [X=v2 ]→[Y=t2 ]   6     [X=v2 ]→[Y=t2 ]

                                              · · ·         𝜋Z ={. . . }      [X=v3 ]→[Z=t3 ]         [X=v3 ]→[Z=t3 ]
Schema Extrac-
                   ...
                   ...
                   ...
                   ...
                   ...
                   ...
                   ...
                   ...
                   ...
  tion, e.g. [1]                                          Dependency Discovery


Figure 4: System architecture overview. (1) A third-party tool is used to extract a JSON Schema description 𝑆 of JSON input
data. Steps (2) through (7) visualize the discovery of tagged unions as schema 𝑇 , and are described in Section 4. In step (8), the
schemas 𝑆 and 𝑇 are composed into a composite schema. Boxed areas capture state-of-the-art algorithms integrated in our
architecture. A, B and C are JSON property labels while X, Y and Z are attributes in our relation encoding (e.g., A.value).


schema. We leave it to future work to recursively merge     implied subschema. Recall that with the dot-notation, we
the composite schema, thereby improving the readability     distinguish the value of JSON property 𝐴 (i.e., 𝐴.value)
as well as the locality of the interacting constraints.     from the subschema of sibling property 𝐵 (i.e., 𝐵.type)
                                                            in our relational encoding of JSON objects.
Relational encoding. In a first step towards the dis- Example 7. From the relational encoding in Table 1, we
covery of tagged unions, we parse all JSON values into a derive the following dependencies, again using the abbre-
single parse tree (adding a virtual root node), and deter- viations t1 and t2 for the subschemas introduced earlier:
mine the set of all labeled paths from the root to a JSON           [type.value = "Point"] → [coordinates.type = t1]
object. For each such path, we then derive the relational     [type.value = "LineString"] → [coordinates.type = t2]
encoding of all properties in the reachable objects. We
have already introduced this data structure in Section 2,      Recall that in the context of our application, they are
and resume the discussion of our running example.           interpreted as follows. Given a JSON object reachable
                                                            by the path /geometries[*], if the value of property
Example 6. Consider the GeoJSON value in Figure 1. type is “Point”, the value of the sibling property coordi-
Table 1 shows the relational encoding of the four objects nates must conform to the subschema t1 (an array of
reachable by the path /geometries[*]. The role of numbers). If the value of property type is “LineString”,
the unique object identifier is to preserve cardinalities then the value of sibling property coordinates must
in repeating attribute values, allowing us to later filter conform to the subschema t2 (an array of points).
dependencies based on their support in the input.
                                                               In the discovery of these dependencies, we leverage
   The running example has been designed to be simple, state-of-the-art algorithms [53, 47, 54, 55] (originally de-
and it ignores challenges such as properties with varying veloped for the discovery of traditional functional de-
types, dealing with several properties whose values have pendencies), such as a compressed-records encoding of
basic types, or null-values in the relational encoding. Our the relation, computing position list indexes (PLIs) and
approach can be extended to robustly handle these cases producing a sequence of candidate dependencies. PLIs
as well, and we refer to Section 4.3 for a discussion on are calculated for each column of the relational encoding
the most interesting future generalizations.                and allow to effectively group all rows (identified by their
                                                            row number) with the same value in the same subset.
Dependency discovery. Based on the relational en-              In the following, we outline the general idea. Inferring
coding, we discover conditional functional dependencies. conditional functional dependencies from PLIs can be
We restrict ourselves to unary constant conditional de- achieved by finding inclusions between them: given PLIs
pendencies in our relational encoding, of the form          𝜋A.value = {{1, 2, 3}, {4, 5, 6}} of attribute 𝐴.value and
                                                            𝜋B.type = {{1, 2, 3, 4}, {5}, {6}} of attribute 𝐵.type, we
              [𝐴.value = 𝑐] → [𝐵.type = 𝜎]                  receive the inclusion {1, 2, 3} ⊂ {1, 2, 3, 4}, allowing us
                                                            to infer that the values of JSON property A corresponding
where the left-hand-side denotes the value of the can-
                                                            to rows 1, 2 and 3 determine the subschemas of the sibling
didate tag, and the right-hand-side a property with an
                                                            property B.
implied subschema. Above, 𝐴 and 𝐵 are distinct prop-
                                                               More sophisticated algorithms developed for the dis-
erty labels, 𝑐 is a basic-value constant, and 𝜎 denotes the
                                                            covery of conditional functional dependencies (CFDs),


                                                               33
such as CFDMiner, CTANE and FastCFD [50], could be     Minimum Threshold 𝜋min . Especially in large data-
deployed just as well. As our current restrictions, most
                                                       sets, it is to be expected that dependencies are inferred
notably the one to unary dependencies, obfuscate many  that have no real semantics, despite us ignoring attributes
of the challenges in FD and CFD discovery, we opted forwith unique values. Our implementation has a config-
our simpler approach.                                  urable minimum threshold: Dependencies with insuffi-
                                                       cient support in the input are then ignored during the
Pruning candidate dependencies. To avoid over- generation of if-then-else statements.
fitting in the derived schemas, we heuristically prune    By allowing to configure this threshold, either as an
candidate dependencies, as described in the upcoming absolute or relative value, the number of discovered de-
Section 4.2.                                           pendencies can be influenced. Our experiments evaluate
                                                       this effect.
Finalization. In a final step, the conditional functional
dependencies are transformed into nested if-then-else 4.3. Future Generalizations of the
statements in JSON Schema 𝑇 . This step is merely tech-              Approach
nical, as is the construction of the composite schema.
                                                              In GeoJSON data, which served as our motivation for
                                                              the discovery of tagged unions, the geometry objects
4.2. Heuristics                                               are highly regular. When inspecting further instances
In our approach, an inherent challenge is dealing with of tagged unions in real-world data, we encounter vari-
high numbers of discovered dependencies, which may ations that require us to generalize our approach. For
lead to overfitting the derived schema w.r.t. the input data. illustration, we again resort to an example.
We next distinguish default heuristics and heuristics that
are configurable by an expert user.                             1 [ { " condition ": " minecraft : time_check " ,
                                                                2     " period ": 24000 ,
                                                                3     " value ": { " min ": 0.0 } } ]
4.2.1. Default Heuristics
                                                                    (a) The value of property min (line 3) is numeric.
The following heuristics are always applied, as they pre-
vent schema bloat due to overfitting.                           1 [ { " condition ": " minecraft : weather_check " ,
                                                                2     " raining ": false ,
                                                                3     " thundering ": false } ,
Single-valued attributes. In discovering dependen-              4   { " condition ": " minecraft : time_check " ,
cies, we ignore all attributes from the relational encoding     5       " period ": 24000 ,
                                                                6       " value ": {
with a single-valued domain. In Table 1, this concerns          7          " min ": {
                                                                8             " target ": {
the attribute type.type.                                        9                " name ": "% StartSunburn " ,
                                                               10                " type ": " minecraft : fixed " } ,
                                                               11             " score ": " Daytime " ,
Unique attributes. Unique property values are rec-             12             " type ": " minecraft : score " } } } ]
ognized as conditional functional dependencies, and ul-
timately, cause overfitting and schema bloat. In depen-      (b) The value of property min (line 7) is of type object.
dency discovery, we therefore ignore all attributes where
the domain consists of unique values (such as the object Figure 5: Minecraft JSON data (snippets edited and short-
                                                          ened, clickable links to GitHub source in the subcaptions).
identifier 𝑂.id, which does not appear in the input data,
                                                          Property condition functions as the tag in the tagged union.
and need not be considered in dependency discovery).

Union rule. We apply the union rule to the discovered          Example 8. We consider JSON data used in the Minecraft
dependencies, so that one tag may imply the subschemas         game, as shown in Figure 5. The property condition
for several sibling properties. This improves schema           functions as a tag, and distinguishes notions such as “time
succinctness.                                                  checks” and “weather checks”, among others. Depending
                                                               on the tag value, different sibling properties exist: Given
4.2.2. Configurable Heuristics                                 that the tag value is a “weather check”, sibling properties
                                                               labeled raining and thundering exist. Given that the
The following heuristics is reasonable in many cases. As
                                                               tag value is a “time check”, sibling properties labeled
it is not universally applicable, we make it configurable.
                                                               period and value exist.
                                                                  To extend our approach to such more general forms
                                                               of tagged unions, we need to be able to detect condi-
                                                               tional dependencies between the value of a candidate tag,


                                                          34
and the existence of a sibling property (which we term           as a fully automated reproduction package, are avail-
value-label constraints, to distinguish from the value-          able online [12]. Additionally, in the PDF version of this
type constraints that we already handle). Here, a specific       article, clickable links in Table 2 allow our readers to di-
challenge is to handle optional sibling properties, which        rectly inspect the input datasets, as well as the extracted
are not always present.                                          schemas.
Example 9. Resuming the previous example of a Minecraft
“time check”, the subschema for property value is an            Datasets. We consider the five JSON datasets listed in
object with a property min. The subschema of property           Table 2, all in the GeoJSON or the related TopoJSON for-
min, however, may vary. In the top example, it is a nu-         mat. For simplicity, we will commonly treat the formats
meric value. In the bottom example, it is a nested object.      GeoJSON and TopoJSON synonymously, in discussing
Our approach outlined so far cannot detect this condi-          our results.
tional functional dependency, because the subschemas               We identified these datasets by manually searching
of the sibling property differ.                                 open data servers, as well as by performing a search over
                                                                all open-source licensed repositories on GitHub using
   To detect value-type dependencies where the types on
                                                                Google BigQuery. Our selection criteria were (1) a suf-
the right-hand-side may vary, we need to relax the rec-
                                                                ficiently large document size and (2) that at least two
ognized subschema. In our current prototype implemen-
                                                                different types of geometries co-occur, so that we may
tation, we introduce attributes min.type1 , . . . , min.type𝑘
                                                                be able to infer tagged unions.
(e.g., for a fixed limit 𝑘 = 6), where we describe the
                                                                   We briefly describe the datasets:
subschema with a varying degree of detail. For instance,
                                                                      • Map Germany is a TopoJSON dataset, describ-
from a shallow min.type1 = {"type": "object} to
                                                                         ing a map of Germany and surrounding areas,
a very fine-grained description that fully captures all
                                                                         including cities, urban areas, and rivers. This
nested objects. In pruning candidate dependencies, we
                                                                         dataset uses geometry types Point, LineString,
only keep the most detailed subschema for which a CFD
                                                                         MultiLineString, Polygon and MultiPolygon.
still holds. This allows us to detect a common subschema
                                                                      • Map EU is a TopoJSON dataset. Its structure is
in many practical cases.
                                                                         comparable to Map Germany but describes coun-
                                                                         tries in the EU in detail and contains the borders
5. Experiments                                                           of some adjacent countries.
                                                                      • WorldGen Berlin is a GeoJSON dataset from Open-
We next describe our experiments using real-world geo-                   StreetMap.
spatial datasets encoded in JSON. For notational simplic-             • US House Members lists members of the US House
ity, we will write “CFDs” when we refer to the subfamily                 of representatives, including their place of birth,
of unary constant conditional functional dependencies,                   hometown and congressional district, encoded in
as introduced in Section 2.                                              GeoJSON.
                                                                      • Wildlife Sites contains wildlife sites in West York-
5.1. Setup                                                               shire, described either as a GeoJSON Polygon or
                                                                         MultiPolygon.
Implementation. Our prototype is implemented in                    The datasets encode GeoJSON or the related TopoJSON
Python, using the library anytree for managing parse            format (distinguished as formats “T” and “G” in the table).
trees. For schema validation, we use the Python library         We list the size of each dataset in lines of code after pretty-
jsonschema (clickable links in the PDF).                        printing (column |𝐷|).
   For the “third-party” schema extraction tool, we em-            In lieu of a ground truth describing the number of con-
ploy a tool that we have built in earlier work [1]. Note        ditional functional dependencies in the data that is not
that our approach is designed to work with any other tool       GeoJSON-encoded, we report in the table the number
for JSON Schema extraction. Also, since we construct            of detectable CFDs in the GeoJSON-parts only (𝐷𝑒𝑡𝐺𝑒𝑜 ).
the composite schema by combining two subschemas,               We manually determine this value for each dataset, in-
we are not restricted to Python-based tools.                    specting all distinct types of GeoJSON objects (e.g., Point,
   To confirm the composability of our approach, we have        LineString, Polygon) on each different path. We ignore
successfully built a variant of our architecture based on       all paths where the value of the GeoJSON type does not
the schema extraction tool by Frozza et al. [31, 56] (with      vary. While these can be detected as CFDs from the data,
minor technical adaptions), but we do not report the            they are not meaningful for recognizing tagged unions,
results, since they do not contribute new insights.             and should be modeled, for instance, simply as constant
                                                                values instead.
Artifact availability and reproducibility. Our re-
search artifacts (code, data, extracted schemas), as well


                                                            35
Table 2
Summary of experimental results. Links embedded in the PDF refer to the original dataset (𝐷). We state the dataset format
(TopoJSON or GeoJSON) and report its size (|𝐷|) in lines of code (LoC), the size of the conventionally extracted schema 𝑆 (in
LoC) and the number of theoretically detectable Geo-/TopoJSON CFDs (𝐷𝑒𝑡𝐺𝑒𝑜 ). For different threshold settings 𝜋min , we
produce a subschema 𝑇 encoding the tagged unions. “Ratio𝑇 ” reports the share of 𝑇 w.r.t. the size of the composite schema.
Further, the number of CFDs initially discovered (w/o heuristics), after applying threshold 𝜋min (see Section 4.2.2), and after
applying the remaining default heuristics (see Section 4.2.1) are reported. The last column reports the result of checking
whether dataset 𝐷 is valid w.r.t. the extracted composite schema (a checkmark symbol “✓” denotes a successful check). The
checkmarks are clickable links, embedded in the PDF, to the composite schema.

Dataset D         For-       |D|       |S|    DetGeo    𝜋min      |T|      Ratio   CFDs w/o      CFDs      CFDs w/       Va-
                  mat      (LoC)     (LoC)                      (LoC)       𝑇      Heuristics     w/       𝜋min and      lid
                                                                                                 𝜋min      Heuristics
                                                         50%         126   23.6%                     11             4     ✓
Map Germany       T       181 585      405          6    35%         158   27.9%          44         14             5     ✓
                                                         15%         190   31.7%                     20             6     ✓
                                                         50%          95   20.7%                      6             3     ✓
Map EU            T       122 380      359          6    35%          95   20.7%          10          6             3     ✓
                                                         15%         127   25.9%                      8             4     ✓
                                                         50%         150   10.0%                     76             4     ✓
WorldGen B.       G        56 441    1 339        11     35%         182   11.9%       1 243        109             5     ✓
                                                         15%         278   17.1%                    184             8     ✓
                                                         50%          54    9.8%                    459             2     ✓
US House M.       G        22 745      493          2    35%          54    9.8%         719        470             2     ✓
                                                         15%          54    9.8%                    660             2     ✓
                                                         50%          38   20.5%                     18             1     ✓
Wildlife Sites    G       876 482      143          2    35%          38   20.5%         360         18             1     ✓
                                                         15%          92   38.5%                     50             3     ✓


Execution environment. Our experiments were con-        We use DetGeo as a target in discussing recall of depen-
ducted on an off-the-shelf notebook with an Intel i7- dency discovery in the GeoJSON part of the input.
1165G7 CPU with 4.7GHz and 16GB of main memory.
                                                                    5.3. Results
5.2. Experimental Design
                                                            Table 2 summarizes our results. Note that all composite
Workflow. For each data collection, we extract com-         schemas successfully validate against Draft-07, and fur-
posite schemas choosing three settings for the minimum      ther, that all JSON datasets validate against the composite
threshold 𝜋min : an aggressive setting of 50%, requiring a  schemas.
conditional functional dependency to occur in at least         Decreasing the threshold 𝜋min leads to more condi-
half of the tuples in the relational encoding of all objectstional functional dependencies being discovered. The
reachable by a given path, and more lenient settings with   share of the subschema encoding tagged unions com-
35% and 15%, respectively.                                  pared to the entire composite schema ranges between
   For each threshold setting, we generate schemas 𝑆        10% and up to approx. 38%. While the 38% share might
(from the third-party tool) and 𝑇 (encoding the tagged      appear to be large, the composite schema has less than
unions), as well as the composite schema.                   250 lines in total, compared to 800K lines of JSON input.
   Each composite schema is validated against the spec-     Thus, this dataset is highly regular in its structure, and
ification of JSON Schema Draft-07, confirming that the      the extracted schema comparatively compact.
composite schema conforms. The input datasets are fur-         For the bottom three datasets, we detect conditional
ther validated against the extracted composite schema,      functional dependencies in the hundreds, even more than
checking for any logical errors in schema extraction.       one thousand in the case of the WorldGen Berlin data.
                                                            However, applying the threshold and the other heuris-
Metrics. In our quantitative assessment, we report tics drastically reduces the dependencies towards the
schema sizes (after syntactic normalization by pretty- number of tagged unions that we expect to find (c.f. the
printing), the number of conditional functional depen- target DetGeo ). Note that there are only six distinct ge-
dencies initially discovered, and after filtering according ometries in GeoJSON which may, however, occur on
to our heuristics.                                          different paths, leading to more than six if-then-else


                                                               36
statements in the GeoJSON part.                                 listed in Example 2), as well as synthetic data, and data
   After applying heuristics and a threshold of 15%, we         not encoding any tagged unions, are required.
only miss two detectable CFDs in Map EU and three in               We do not focus on the performance of our unopti-
WorldGen Berlin. Missed CFDs can be attributed to the           mized prototype implementation, as the algorithm is
threshold being too coarse.                                     only executed once for each dataset. Of course, scal-
   Manually inspecting the final dependencies, we find          ability to larger inputs is an issue to be addressed in
one unexpected dependency for Wildlife Sites with a             future work, as CFD discovery in general is an expensive
threshold of 15%. This is the only false positive depen-        problem: Already in addressing the discovery of func-
dency across all datasets. It results in a tagged union         tional dependencies in their full generality, algorithms
being declared for a specific date, which is not seman-         have exponential runtime complexity [57]. Several algo-
tically meaningful. Further, this dependency is outside         rithms for the specific problem of CFD discovery have
the GeoJSON-encoded part of the data, explaining why            been proposed [50]. Their evaluations show that runtime
the number of detected CFDs is higher than 𝐷𝑒𝑡𝐺𝑒𝑜 .             performance depends heavily on the input, with some
Thus, we have a single case of overfitting for the datasets     algorithms scaling better with size of the dataset (i.e., the
analyzed. Overall, we observe very high precision in rec-       number of tuples) thus being suitable for large datasets,
ognizing the GeoJSON/TopoJSON dependencies resident             while others perform better with higher arity. For con-
in the data.                                                    stant CFDs, Li et al. [52] achieved promising runtime
   We do not focus on runtime evaluation, as our im-            improvements by applying custom rules for pruning the
plementation is prototypical and unoptimized. Yet to            search space.
provide a general perspective, we share our observation            With strong restrictions to the problem space, such as
that runtimes vary greatly between datasets, ranging            our restriction to unary constant CFDs in our case, we can
from roughly one second to approximately one minute.            expect reasonable runtime performance on real-world
This is a waiting time which we deem acceptable for a           inputs (which are often reasonably well-behaved).
non-interactive, irregular task.                                   In our experiments, we observed a considerable impact
   Since our approach is main-memory based, memory              of our heuristics and threshold on runtime, reducing the
is a physical limitation to the inputs that we can pro-         processing time for all datasets by up to two orders of
cess. For the datasets considered here, 16GB of RAM are         magnitude. This is promising for further optimizations.
sufficient to run our experiments.

                                                                 6. Conclusion and Outlook
5.4. Discussion
The comparison between the number of dependencies               In this article, we proposed a method to infer conditional
found with different heuristics shows the effectiveness         functional dependencies from JSON data. We use the
of these heuristics in pruning dependencies. In particu-        identified dependencies as the basis for declaring tagged
lar, this concerns the configuration knob represented by        unions in JSON schema extraction.
minimum threshold.                                                 This allows us to capture value-based constraints. In
   Our approach to setting this threshold is rather coarse,     fact, in [35] Gallinucci et al. report on expert interviews
with 50% obviously too rigid, but 15% delivering mean-          with users, regarding the users’ preferences in schemas
ingful results. In future work, we plan to investigate          over nested data. Their interviews reveal that value-
how to auto-adjust the threshold, based on statistical          based conditions have a greater influence on the differ-
distributions obtained by profiling the input.                  entiation of schema variants than structural constraints,
   An obvious threat to the generalizability of our re-         and are therefore preferred.
sults — however promising they are — is that the Geo-              In future work, our approach can be extended for the
JSON datasets are highly regular in their structure. In         extraction of further variants of dependencies, for in-
the GeoJSON-part, they contain only a small number of           stance, traditional conditional functional dependencies
detectable dependencies, between two and eleven. There-         that capture implications between atomic property val-
fore, we refrain from computing the metrics precision           ues, or dependencies where the value of the tag property
and recall, as they are easily distorted when working           implies the existence of a specific sibling property, as
with small numbers. However, the manual inspection of           discussed in greater detail in Section 4.3. This allows to
the derived dependencies, and the comparison against            recognize a larger family of tagged unions.
the target DetGeo , shows that our approach is indeed suc-         Our prototype implementation is currently main-mem-
cessful for the input datasets chosen.                          ory based, which limits the size of inputs that we can
   To counter the threat of generalizability, further experi-   handle. Making our implementation scale to larger inputs
ments over different real-world datasets (e.g., the datasets    is one of the immediate next steps. Here, we may build
                                                                upon first results by Mior [18], who shows that discover-


                                                            37
ing dependencies in a relational encoding of JSON data          [4] P. Contos, M. Svoboda, JSON schema inference
has inferior runtime performance when compared to dis-              approaches, in: Proc. ER (Workshops), 2020, pp.
covering dependencies in a streaming fashion. Also, we              173–183.
plan to consider a MapReduce-based approach to schema           [5] I. Veinhardt Latták., P. Koupil., A comparative anal-
extraction, as implemented by Baazizi et al. in [2].                ysis of JSON schema inference algorithms, in: Proc.
   Ultimately, our goal is to obtain a schema declaration           ENASE, 2022, pp. 379–386.
that human consumers consider to be comprehensive, but          [6] M. Klettke, H. Awolin, U. Störl, D. Müller,
that may also be efficiently processed programmatically.            S. Scherzinger, Uncovering the evolution history of
In order to obtain more succinct schemas, we need to                data lakes, in: Proc. Big Data, 2017, pp. 2462–2471.
resolve redundancies between the schema extracted by a          [7] M. A. Baazizi, C. Berti, D. Colazzo, G. Ghelli, C. Sar-
third-party tool and our encoding of tagged unions. This            tiani, Human-in-the-loop schema inference for
requires rewriting the composite schema based on an                 massive JSON datasets, in: Proc. EDBT, 2020, pp.
algebraic representation of JSON Schema operators, such             635–638.
as the schema algebra proposed by Attouche et al. [13].         [8] J. Namba, Enhancing JSON schema discovery by
   Further, identifying metrics that capture the quality of         uncovering hidden data, in: Proc. VLDB 2021 PhD
the extracted schemas will allow to quantitatively com-             Workshop, 2021.
pare schemas extracted by different approaches. Given           [9] M. Droettboom, Understanding JSON schema,
suitable metrics, the configuration of heuristics could             https://json-schema.org/understanding-json-
even be adjusted automatically. A possible direction is             schema/reference/conditionals.html#if-then-else,
to explore the notions of precision and recall, and the             2022. Draft 2020-12.
proxy-metric of schema entropy, as introduced by Spoth         [10] M. A. Baazizi, D. Colazzo, G. Ghelli, C. Sartiani,
et al. in [3].                                                      S. Scherzinger, An empirical study on the "usage
   A further task is to adopt a CFD inference algorithm             of not" in real-world JSON schema documents, in:
that is robust despite poor data quality and that can infer         Proc. ER, 2021, pp. 102–112.
constraints despite outliers in the data. Naturally, this      [11] A. A. Frozza, R. dos Santos Mello, Js4geo: A canon-
requires a relaxation to “soft” CFDs, a task where we may           ical JSON schema for geographic data suitable to
also build upon existing work on relational [21] and even           NoSQL databases, GeoInformatica 24 (2020) 987–
JSON data [18].                                                     1019.
   In summary, our long-term vision is to extract com-         [12] S. Klessinger, M. Klettke, U. Störl, S. Scherzinger,
prehensible and therefore human-consumable schema                   Extracting JSON Schemas with Tagged
declarations from JSON data. We believe that the detec-             Unions (Reproduction Package), 2022. URL:
tion of schema design patterns that are popular among               https://doi.org/10.5281/zenodo.6985647.
schema designers, such as tagged unions, is an important            doi:10.5281/zenodo.6985647.
building block towards realizing this vision.                  [13] L. Attouche, M. A. Baazizi, D. Colazzo, G. Ghelli,
                                                                    C. Sartiani, S. Scherzinger,          Witness genera-
Acknowledgments: This work was funded by Deut-                      tion for JSON schema, CoRR abs/2202.12849
sche Forschungsgemeinschaft (DFG, German Research                   (2022). URL: https://arxiv.org/abs/2202.12849.
Foundation) grant #385808805. We thank Thomas Kirz                  arXiv:2202.12849.
for expertly typesetting the systems architecture in LATEX.    [14] F. Pezoa, J. L. Reutter, F. Suárez, M. Ugarte, D. Vrgoc,
We thank the anonymous reviewers of DEco’22, espe-                  Foundations of JSON Schema, in: Proc. WWW,
cially reviewer #2, for the detailed and helpful feedback.          2016, pp. 263–273.
                                                               [15] F. Suárez, J. Reutter, D. Vrgoc, M. Ugarte, F. Pezoa,
                                                                    JSON Schema: Multiple Types (Software), https://
References                                                          cswr.github.io/JsonSchema/spec/multiple_types/,
                                                                    2016.
 [1] M. Klettke, U. Störl, S. Scherzinger, Schema extrac-      [16] P. Bohannon, W. Fan, F. Geerts, X. Jia, A. Kementsi-
     tion and structural outlier detection for JSON-based           etsidis, Conditional functional dependencies for
     NoSQL data stores, in: Proc. BTW, volume P-241,                data cleaning, in: Proc. ICDE, 2007, pp. 746–755.
     GI, 2015, pp. 425–444.                                    [17] M. Arenas, L. Libkin, A normal form for xml
 [2] M. A. Baazizi, D. Colazzo, G. Ghelli, C. Sartiani,             documents, ACM Trans. Database Syst. 29 (2004)
     Parametric schema inference for massive JSON                   195–232.
     datasets, VLDB J. 28 (2019) 497–521.                      [18] M. J. Mior, Fast discovery of nested dependencies on
 [3] W. Spoth, O. Kennedy, Y. Lu, B. C. Hammerschmidt,              JSON data, CoRR abs/2111.10398 (2021). URL: https:
     Z. H. Liu, Reducing Ambiguity in JSON Schema                   //arxiv.org/abs/2111.10398. arXiv:2111.10398.
     Discovery, in: Proc. SIGMOD, 2021, pp. 1732–1744.         [19] S. Kruse, A. Jentzsch, T. Papenbrock, Z. Kaoudi,


                                                          38
     J. Quiané-Ruiz, F. Naumann, Rdfind: Scalable con-        [35] E. Gallinucci, M. Golfarelli, S. Rizzi, Schema profil-
     ditional inclusion dependency discovery in RDF                ing of document-oriented databases, Inf. Syst. 75
     datasets, in: Proc. SIGMOD, 2016, pp. 953–967.                (2018) 13–25.
[20] J. Friesen, Java XML and JSON: Document Process-         [36] S. Song, F. Gao, R. Huang, C. Wang, Data depen-
     ing for Java SE, Apress, 2019, pp. 299–322.                   dencies extended for variety and veracity: A family
[21] J. Rammelaere, F. Geerts, Revisiting conditional              tree, IEEE Transactions on Knowledge and Data
     functional dependency discovery: Splitting the "C"            Engineering 34 (2022) 4717–4736.
     from the "FD", in: Proc. ECML PKDD, 2018, pp.            [37] M. L. Lee, T. W. Ling, W. L. Low, Designing func-
     552–568.                                                      tional dependencies for XML, in: Proc. EDBT, 2002,
[22] W3C, W3C XML Schema Definition Language                       pp. 124–141.
     (XSD) 1.1 Part 1: Structures, W3C Recom-                 [38] S. Hartmann, S. Link, More functional dependencies
     mendation, 2012. URL: https://www.w3.org/TR/                  for xml, in: L. Kalinichenko, R. Manthey, B. Thal-
     xmlschema11-1/.                                               heim, U. Wloka (Eds.), Advances in Databases and
[23] ISO/IEC, ISO/IEC 19757-3:2020 Information tech-               Information Systems, Springer Berlin Heidelberg,
     nology — Document Schema Definition Languages                 Berlin, Heidelberg, 2003, pp. 355–369.
     (DSDL) — Part 3: Rule-based validation — Schema-         [39] L. Kot, W. M. White, Characterization of the inter-
     tron, ISO/IEC Standard, 2020. URL: https://www.               action of XML functional dependencies with DTDs,
     iso.org/standard/74515.html.                                  in: Proc. ICDT, 2007, pp. 119–133.
[24] ISO/IEC, ISO/IEC 19757-2:2008 Information tech-          [40] S. Kruse, T. Papenbrock, F. Naumann, Scaling out
     nology — Document Schema Definition Language                  the discovery of inclusion dependencies, in: Proc.
     (DSDL) — Part 2: Regular-grammar-based valida-                BTW, 2015, pp. 445–454.
     tion — RELAX NG, ISO/IEC Standard, 2008. URL:            [41] T. Vajk, P. Fehér, K. Fekete, H. Charaf, Denormal-
     https://www.iso.org/standard/52348.html.                      izing data into schema-free databases, in: Proc.
[25] M. Garofalakis, A. Gionis, R. Rastogi, S. Seshadri,           CogInfoCom, 2013, pp. 747–752.
     K. Shim, XTRACT: A system for extracting docu-           [42] X. Chu, I. F. Ilyas, S. Krishnan, J. Wang, Data clean-
     ment type descriptors from XML documents, in:                 ing: Overview and emerging challenges, in: Proc.
     Proc. SIGMOD, 2000, pp. 165–176.                              SIGMOD, 2016, pp. 2201–2206.
[26] C.-H. Moh, E.-P. Lim, W.-K. Ng, DTD-Miner: a tool        [43] S. Schelter, D. Lange, P. Schmidt, M. Celikel, F. Biess-
     for mining DTD from XML documents, in: Proc.                  mann, A. Grafberger, Automating large-scale data
     WECWIS, 2000, pp. 144–151.                                    quality verification, Proc. VLDB Endow. 11 (2018)
[27] J. Hegewald, F. Naumann, M. Weis, Xstruct: Ef-                1781–1794.
     ficient schema extraction from multiple and large        [44] J. Kivinen, H. Mannila, Approximate inference of
     XML documents, in: Proc. Workshops ICDE, 2006.                functional dependencies from relations, Theoretical
[28] M. Klempa, M. Kozák, M. Mikula, R. Smetana,                   Computer Science 149 (1995) 129–149.
     J. Stárka, M. Svirec, M. Vitásek, M. Necaský, I. Hol-    [45] H. Mannila, K.-J. Räihä, On the complexity of in-
     ubová, jInfer: A framework for XML schema infer-              ferring functional dependencies, Discrete Applied
     ence, Comput. J. 58 (2015) 134–156.                           Mathematics 40 (1992) 237–243.
[29] B. Chidlovskii, Schema extraction from xml collec-       [46] H. Yao, H. J. Hamilton, Mining functional depen-
     tions, in: Proc. JCDL, 2002, p. 291–292.                      dencies from data, Data Mining and Knowledge
[30] I. Mlýnková, M. Nečaský, Heuristic methods for in-            Discovery 16 (2008) 197–219.
     ference of xml schemas: Lessons learned and open         [47] T. Papenbrock, F. Naumann, A hybrid approach
     issues, Informatica 24 (2013) 577–602.                        to functional dependency discovery, in: Proc. SIG-
[31] A. A. Frozza, R. dos Santos Mello, F. de Souza da             MOD, 2016, pp. 821–833.
     Costa, An approach for schema extraction of JSON         [48] J. Bauckmann, U. Leser, F. Naumann, V. Tietz, Effi-
     and extended JSON document collections, in: Proc.             ciently detecting inclusion dependencies, in: Proc.
     IRI, 2018, pp. 356–363.                                       ICDE, 2007, pp. 1448–1450.
[32] D. S. Ruiz, S. F. Morales, J. G. Molina, Inferring       [49] R. Hai, C. Quix, D. Wang, Relaxed functional de-
     versioned schemas from NoSQL databases and its                pendency discovery in heterogeneous data lakes,
     applications, in: Proc. ER, 2015, pp. 467–480.                in: Proc. ER, 2019, pp. 225–239.
[33] J. L. C. Izquierdo, J. Cabot, Discovering implicit       [50] W. Fan, F. Geerts, L. V. S. Lakshmanan, M. Xiong,
     schemas in JSON data, in: Proc. ICWE, 2013, pp.               Discovering conditional functional dependencies,
     68–83.                                                        in: Proc. ICDE, 2009, pp. 1231–1234.
[34] D. Durner, V. Leis, T. Neumann, JSON tiles: Fast an-     [51] W. Fan, F. Geerts, J. Li, M. Xiong, Discovering
     alytics on semi-structured data, in: Proc. SIGMOD,            conditional functional dependencies, IEEE Trans.
     2021, pp. 445–458.                                            Knowl. Data Eng. 23 (2011) 683–698.


                                                         39
[52] J. Li, J. Liu, H. Toivonen, J. Yong, Effective pruning
     for the discovery of conditional functional depen-
     dencies, Comput. J. 56 (2013) 378–392.
[53] I. F. Ilyas, X. Chu, Data Cleaning, ACM, 2019.
[54] P. Schirmer, T. Papenbrock, S. Kruse, F. Nau-
     mann, D. Hempfing, T. Mayer, D. Neuschäfer-Rube,
     Dynfd: Functional dependency discovery in dy-
     namic datasets, in: Proc. EDBT, 2019, pp. 253–264.
[55] Z. Wei, S. Link, Discovery and ranking of functional
     dependencies, in: Proc. ICDE, 2019, pp. 1526–1537.
[56] A. A. Frozza, R. dos Santos Mello, F. de Souza da
     Costa, An approach for schema extraction of
     JSON and extended JSON document collec-
     tions (software), https://github.com/gbd-ufsc/
     JSONSchemaDiscovery, 2018.
[57] J. Liu, J. Li, C. Liu, Y. Chen, Discover dependencies
     from data - A review, IEEE Trans. Knowl. Data Eng.
     24 (2012) 251–264.


                                                          40