Extracting JSON Schemas with tagged unions Stefan Klessinger1 , Meike Klettke2 , Uta Störl3 and Stefanie Scherzinger1 1 University of Passau, Passau, Germany 2 University of Regensburg, Regensburg, Germany 3 University of Hagen, Hagen, Germany Abstract With data lakes and schema-free NoSQL document stores, extracting a descriptive schema from JSON data collections is an acute challenge. In this paper, we target the discovery of tagged unions, a JSON Schema design pattern where the value of one property of an object (the tag) conditionally implies subschemas for sibling properties. We formalize these implications as conditional functional dependencies and capture them using the JSON Schema operators if-then-else. We further motivate our heuristics to avoid overfitting. Experiments with our prototype implementation are promising, and show that this form of tagged unions can successfully be detected in real-world GeoJSON and TopoJSON datasets. In discussing future work, we outline how our approach can be extended further. 1. Introduction 1 { " type ": " G e o m e t r y C ol l e c t i o n " , 2 " geometries ": [ 3 { " type ": " Point " , JSON is a popular data exchange format. Extracting a 4 " coordinates ": [30 ,10] } , schema from collections of JSON documents is a real- 5 { " type ": " Point " , 6 " coordinates ": [40 ,15] } , world challenge which is actively being researched [1, 2, 7 { " type ": " LineString " , 3, 4, 5, 6, 7, 8]. 8 " coordinates ": [[55 ,5] , [10 ,30] , [10 ,10]] } , 9 { " type ": " LineString " , Ideally, the extracted schema describes the data tightly, 10 " coordinates ": [[30 ,10] , [10 ,30] , [40 ,40]] } yet without overfitting. In this article, we target the de- 11 ] } tection of a specific schema design pattern in the JSON Schema language, the pattern of tagged unions, also known Figure 1: GeoJSON data. The value of property type (here as discriminated unions, labeled unions, or variant types. Point and LineString) determines the subschema of prop- This is a recommended design pattern [9], and has been erty coordinates as either encoding a geometric point (an array of numbers), or a line (an array of points). found to be quite common in real-world schemas [10]. Example 1. Figure 1 shows GeoJSON data. The array starting in line 2 holds four objects, of type Point and type with the value Point, then the value of property LineString. Point coordinates are encoded as an array coordinates must be an array of numbers. Else, if the of numbers, while lines are encoded as an array of points, and hence, an array of number arrays. property type has the value LineString, the value of coordinates must be an array of points, and hence, an The GeoJSON specification (IETF RFC 7946, clickable link embedded in the PDF) describes six types of ge- array of number arrays. ometries, including polygons and multi-polygons. Con- Example 2. Various schemas listed on SchemaStore.org, sistently, the property type serves as a tag to distin- a community-curated collection of JSON Schema dec- guish the subschema of sibling property coordinates, larations, encode tagged unions using if-then-else. thereby instantiating a tagged union. Among them are the Minecraft schemas which use tagged GeoJSON comes without an official JSON Schema spec- unions to encode so-called data packs for configuring ification [11]. In Figure 2, we therefore show a hand- Minecraft worlds. Other examples are Github Issue Forms, crafted excerpt using the if-then-else construct to or the cloudify schema (clickable links in the PDF). enforce the tagged union, with the following seman- tics: If the object in question has a property labeled In schema extraction, the detection of tagged unions has so far received limited attention: Several approaches are able to detect union types, i.e., properties whose type DEco@VLDB’22: First International Workshop on Data Ecosystems, September 5, 2022, Sydney, Australia is a union of types, but they do not detect the depen- $ stefan.klessinger@uni-passau.de (S. Klessinger); dency w.r.t. a specific property value (the “tag” in the meike.klettke@ur.de (M. Klettke); uta.stoerl@fernuni-hagen.de tagged union). Figure 3a shows an example: This schema (U. Störl); stefanie.scherzinger@uni-passau.de (S. Scherzinger) for GeoJSON data does not restrict the values for prop-  0000-0003-0551-8389 (M. Klettke); 0000-0003-2771-142X erty type (it allows all strings) and allows alternative (U. Störl) © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License subschemas for property coordinates. As it is overly CEUR Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings (CEUR-WS.org) Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 general, the schema allows to encode lines as mere ar- 27 1 " if ": { by state-of-the-art third-party tools. 2 " properties ": { • Our experiments on real-world GeoJSON datasets 3 " type ": { " const ": " Point " } } , 4 " required ": [ " type " ] } , (and its sibling format TopoJSON) show that mean- 5 " then ": { 6 " properties ": { ingful tagged unions can indeed be identified. We 7 " coordinates ": { illustrate the impact of a configurable threshold 8 " type ": " array " , 9 " items ": { " type ": " number " } } } } , on the number of tagged unions detected, and 10 " else ": { consequently, the size of the extracted schema. 11 " if ": { 12 " properties ": { We further outline promising directions of future work. 13 " type ": { " const ": " LineString " } } , 14 " required ": [ " type " ] } , 15 " then ": { Artifact availability and reproducibility. We have 16 " properties ": { 17 " coordinates ": { made our research artifacts (code, data, extracted schemas), 18 " type ": " array " , as well as a fully automated reproduction package, long- 19 " items ": { 20 " type ": " array " , term available online [12]. 21 " items ": { " type ": " number " } } } } } } Structure of this paper. In Section 2, we introduce Figure 2: JSON Schema snippet declaring a tagged union for the preliminaries. Section 3 discusses related work. In GeoJSON geometry objects Point and LineString. Section 4, we present our approach, specifically the archi- tecture and heuristics employed, and an outlook on future extensions. In Section 5, our experiments are presented rays of numbers (rather than arrays of number-arrays), and discussed. In Section 6, we draw our conclusions and in violation of the GeoJSON specification. discuss opportunities for future work. Existing approaches to JSON Schema extraction fail to detect tagged unions. Notably, in describing their approach to schema extraction based on typing JSON 2. Preliminaries values, Baazizi et al. [2] outline how their approach can 2.1. JSON Data Model and JSON Schema be extended to include tagged unions. However, they do not discuss strategies against overfitting to the input data JSON data model. during the discovery of tagged unions. The grammar below (adopted from [13]) captures the syn- Our proposal follows a different approach, and relies tax of JSON values, namely basic values, objects, or arrays. on a relational encoding of JSON objects from which we Basic values 𝐵 include the null value, Booleans, num- then derive conditional dependencies. A central part of bers 𝑛, and strings 𝑠. Objects 𝑂 represent sets of mem- bers, each member being a name-value pair, arrays 𝐴 our contribution are our heuristics, which filter out de- represent sequences of values. pendencies that have insufficient support in the input data, so that they are not reflected in the derived schema. 𝐽 ::= 𝐵 | 𝑂 | 𝐴 Specifically, ours is the first proposal towards the discov- 𝐵 ::= null | true | false | 𝑛 | 𝑠 𝑛 ∈ Num, 𝑠 ∈ Str ery of tagged unions that — to our knowledge — includes 𝑂 ::= {𝑙1 : 𝐽1 , . . . , 𝑙𝑛 : 𝐽𝑛 } 𝑛 ≥ 0, 𝑖 ̸= 𝑗 ⇒ 𝑙𝑖 ̸= 𝑙𝑗 𝐴 ::= [𝐽1 , . . . , 𝐽𝑛 ] 𝑛≥0 an experimental evaluation over real-world data. JSON Schema. JSON Schema is a language for defin- Contributions. This paper makes the following con- ing the structure of JSON documents. The syntax and tributions: semantics of JSON Schema have been formalized in [14], • We target the detection of tagged unions in JSON and we informally present some of the main keywords: schema extraction, specifically, tagged unions Assertions include required, enum, const, pattern that are based on dependencies between a prop- and type, and indicate a test that is performed on the erty value and the implied subschema of its sib- corresponding instance. ling properties. Applicators include the Boolean operators not, anyOf, • Our approach relies on the discovery of unary allOf, oneOf, as well as, if-then-else. They fur- constant conditional functional dependencies in ther include the object operators patternProperties, a relational encoding of the JSON objects. Tradi- additionalProperties, and properties, and the ar- tionally, conditional functional dependencies are ray operator items, and the reference operator $ref. employed in the context of data cleaning, and we They apply a different operator to the same instance or apply them to a new domain. to a component of the current instance. • Our approach is composable with existing algo- Annotations (title, description, and $comment) rithms for JSON schema extraction, as we impose do not affect validation but indicate an annotation asso- the tagged unions on top of the schemas derived ciated to the instance. 28 1 " anyOf ": [ 1 " anyOf ": [ 2 { " type ": " object " , 2 { " type ": " object " , 3 " properties ": { 3 " properties ": { 4 "type": { "type": "string" }, 4 "type": { "const": "Point" }, 5 " coordinates ": { 5 " coordinates ": { 6 " type ": " array " , 6 " type ": " array " , 7 " items ": { " type ": " number " } } } 7 " items ": { " type ": " number " } } } 8 }, 8 }, 9 { " type ": " object " , 9 { " type ": " object " , 10 " properties ": { 10 " properties ": { 11 "type": { "type": "string" }, 11 "type": { "const": "LineString" }, 12 " coordinates ": { 12 " coordinates ": { 13 " type ": " array " , 13 " type ": " array " , 14 " items ": { " type ": " array " , 14 " items ": { " type ": " array " , 15 " items ": { " type ": " number " } 15 " items ": { " type ": " number " } 16 } } } } ] 16 } } } } ] (a) Union type. (b) Tagged union. Figure 3: Snippets of different encodings of GeoJSON geometries in JSON Schema. Left: Union type encoding, as extracted by traditional schema extraction tools (with syntactic variations, depending on the tool). Right: Tagged union encoding using anyOf, as an alternative to the if-then-else construct. While near-identical in syntax, differing only in lines 4 and 11, the schema semantics differ, as the union type allows incorrect encodings of points or line strings in GeoJSON. Union types A property with several possible types “Point” with coordinates encoded as an array of numbers (or subschemas) can be described as a union type [15], or a “LineString” with an array of points. i.e., the union of all types the property assumes. In JSON Note that while Figures 3a and 3b differ only marginally Schema, this can be encoded by a disjunction, using the in their syntax, the difference in semantics is striking: union operators anyOf or oneOf (where the former is The tagged union captures the dependency between the inclusive, and the latter exclusive). Such an encoding is value of property type and the subschema of property exemplified in Figure 3a for GeoJSON. Union types are coordinates. While union type encodings can be de- recognized by most of the existing tools for JSON Schema rived by several state-of-the-art schema extraction tools, extraction (as we will discuss in greater detail in our dis- schemas with tagged unions are – so far – manually cussion of related work). While the schema distinguishes crafted, since existing approaches to schema extraction two variants for encoding coordinates (an array of are not capable of discovering value-based dependencies. numbers, or an array thereof), it does not capture any Our approach can produce either the if-then-else dependencies between the value of property type and encoding or the encoding exemplified in Figure 3b. We the subschema for the sibling property coordinates. chose to implement the former, because it is more lenient Also, the domain of property type is not restricted to w.r.t. unexpected tag values (in this case no restrictions specific string values. are specified). Note that this is not a limitation of our approach, and a merely technical limitation. Tagged unions The if-then-else operator allows for declaring tagged unions, and was introduced (rather 2.2. Dependencies recently) with JSON Schema Draft 7. In Example 1, we informally introduced the semantics. For the relational model, the concept of functional depen- As for terminology, we distinguish one property as the dencies is well explored, and various generalizations are tag (in our example, the property labeled type), and known, such as conditional functional dependencies [16] identify one or more properties with implied subschemas which only apply to a subset of the tuples. In the follow- (in our example, for coordinates). ing, we extend these notions to JSON data, assuming a The tag-property may go by any name. While the relational encoding of all objects that are reachable by schemas for GitHub Issue Forms and cloudify from Ex- the same path from the document root. This idea of a ample 2 incidentally also rely on a tag labeled “type”, in relational encoding of semi-structured data for the defini- the example of Minecraft (see data snippets in Figure 5), tion or detection of dependencies is a common approach, the tag-property is labeled “condition”. e.g., for XML [17], JSON [18], or RDF data [19]. An alternative encoding for tagged unions is to use a union operator, as shown in Figure 3b. Two different com- Relational encoding Given a JSON value, we consider binations of values for property type and subschemas all labeled paths from the root to a JSON object. Paths for property coordinates are given in accordance with may be encoded in JSONPath [20], a straightforward path the GeoJSON definition. Again, an object may either be a language for JSON. 29 We next introduce the schema for our relational en- Dependencies We next introduce dependencies over coding. We reserve attribute 𝑂.id for an (internal) object this relational encoding. Traditional functional depen- identifier. The object identifier must be unique, but we dencies (FDs) capture constraints that hold on all tuples do not impose any constraints on its semantics. In the of a relation. Moreover, conditional functional dependen- following, we will simply use the line of code in the file cies (CFDs) [16] are functional dependencies that hold containing the JSON Schema declaration (after pretty on only a subset of the tuples. While in their full gen- printing), where the scope of the object is first entered. erality, CFDs are a generalization of classical functional We identify the labels of all properties reachable by a dependencies, we will focus on a very restricted subclass given path 𝑝: that is related to association rules [21], and that can be defined quite compactly. 𝐿𝑝 = {𝑙𝑖 | object {𝑙1 : 𝐽1 , . . . , 𝑙𝑛 : 𝐽𝑛 } Definition 4. If 𝒜 is a set of attributes, then a unary con- is reachable by path 𝑝} stant conditional functional dependency over 𝒜 (ucCFD) is an expression of the form For each property label 𝑙𝑖 in 𝐿𝑝 where 𝐽𝑖 occurs as a basic value, we define a relational attribute 𝑙𝑖 .value [𝐴 = 𝑎] → [𝐵 = 𝑏] that captures the basic value 𝐽𝑖 . These properties are considered to be candidates for tags. where 𝐴, 𝐵 are attributes in 𝒜 and 𝑎, 𝑏 are constants Further, for each property label 𝑙𝑖 in 𝐿𝑝 , we define from the domains of 𝐴 and 𝐵 respectively. A relation an attribute 𝑙𝑖 .type, capturing the subschema directly 𝑅 over 𝒜 satisfies [𝐴 = 𝑎] → [𝐵 = 𝑏] if for each derived from its value 𝐽𝑖 . pair of tuples 𝑠, 𝑡 ∈ 𝑅, 𝜋𝐴 (𝑠) = 𝜋𝐴 (𝑡) = 𝑎 implies For each object reachable by path 𝑝, we then insert 𝜋𝐵 (𝑠) = 𝜋𝐵 (𝑡) = 𝑏. one tuple into this relation, choosing some unique object Example 5. The dependency below holds in Table 1 and identifier for each object. The attribute values are pop- reads as follows: ulated with the semantics described above; null values mark missing entries. [type.value = "Point"] → [coordinates.type = t1] Example 3. Table 1 shows the encoding for the JSON The left-hand-side, in brackets, declares a condition objects from Figure 1 in the array starting from line 3, that must be satisfied for the dependency to hold: We con- reachable by the JSONPath /geometries[*]. By t1 sider all attributes where the value of attribute type.value and t2, we abbreviate the subschemas directly derived is the string “Point”. For all tuples where this condition is from the JSON values, requiring an array of numbers (t1) satisfied, the value of attribute coordinates.type must and an array of arrays of numbers (t2), namely be the subschema abbreviated as t1. { "type" : "array", In our domain of application, namely JSON values, (t1) these dependencies express powerful constraints between "items": { "type":"number" }} property values and subschemas: In the example depen- and further dency above, it is implied that if the value of property type is the string constant “Point”, then the value of { "type": "array", "items":{ the sibling property coordinates must conform to the "type": "array", (t2) subschema t1. "items": { "type": "number" }}} In the remainder of this article, we exclusively focus on the discovery of such value-type constraints, where the attribute on the left-hand-side of a ucCFD is of the form “𝐴.value” (the value of property 𝐴, which we consider Table 1 to be a candidate for a tag in a tagged union), and the Relational encoding for the objects in Figure 1 that are reach- able by path /geometries[*] (JSONPath syntax). Sub- attribute on the right-hand-side is of the form “𝐵.type” schemas t1 and t2 abbreviated as in Example 3. (the subschema of the sibling property 𝐵). Traditionally, tagged unions are declared by switch- 𝑂.id type.value type.type coordinates.type ing on the value of a single tag property. This is also 3 Point string t1 the recommended practice in JSON Schema [9], and in 5 Point string t1 agreement with what we observe in real-world data [10]. 7 LineString string t2 We are therefore confident that our restriction to unary 9 LineString string t2 dependencies is justified. 30 3. Related Work (quantitative comparison). Several of the examined ap- proaches also support the extraction of union types: [1] Our article builds upon the rich body of related work and [31] use the JSON Schema keyword oneOf, while [2] in the area of schema extraction and the theory of data use the union type constructor of their own proprietary dependencies. schema language. The authors of [32] encode the ex- tracted schema in the XML Schema language, and encode Schema and constraint definition. Schema lan- union types using entity versioning, while [33] pursue guages for semi-structured data are well-researched. an alternative approach of reducing different types to XML was developed as a semi-structured data format their most generic type. with implicit structural information and an optional ex- But neither Klettke et al. [1] nor Frozza et al. [31] plicit schema. The simplest schema language for XML is support tagged unions within their extraction of union DTD (Document Type Definition). The lack of means to types. The approach by Baazizi et al. [2] is based on type express data types and exact cardinalities in DTDs moti- inference. They achieve scalability by inferring types in vated the development of further schema languages, such a MapReduce-based approach. The authors do discuss as XML Schema [22], Schematron [23], and RelaxNG [24]. the challenge of extracting tagged unions, and describe All three support the definition of constraints. In XML an extension to their algorithm to address this challenge. Schema, alternatives can be defined by specifying con- However, there is no implementation or evaluation of ditions on path expressions (XPath). In version 1.1 of this feature, or of any heuristics to prevent overfitting in XML Schema, the concept of assertions allows to encode tagged unions. constructs such as tagged unions. Schematron can define In a further, recent contribution, Spoth et al. [3] focus rules with context information (XPath) and messages on resolving ambiguities during schema extraction, such (that are sent in the success or error case). RelaxNG en- as sets encoded as objects rather than arrays. As their ables the definition of inference rules. This historical approach does not consider property values, it cannot excursion shows the necessity of exact and expressive detect tagged unions. schema languages. The same holds true for the descrip- Durner et al. [34] recently presented an approach for tion of JSON data. The preliminaries on JSON Schema fast analytics on semi-structured data. Using various were already covered in paragraph 2.1. algorithms, the JSON data is divided into tiles, and lo- cal schemas are extracted. However, tagged unions are Schema extraction. In relational databases which fol- neither considered nor recognized. low a schema-first approach, all available databases have A completely different approach for schema inference an explicit schema stored in the databases catalog. With has been suggested in [35]. In this work, a supervised semi-structured data, on the other hand, there are many learning method (based on the well-known C4.5 classifi- datasets available that have been published without such cation algorithm) is used for detecting hidden rules in the explicit schema information. Schema extraction (also different variants of datasets. These rules can be either known as reverse engineering) is therefore an important structure-based or value-based. An interesting observa- subtask in data profiling for semi-structured data. tion of this work was that based on an empirical study (interviews), value-based rules are considered more im- XML schema extraction. For schema extraction from portant by human consumers than structure-based rules XML documents, different approaches have been devel- to distinguish variants. The result of the approach is a oped, (e.g. [25, 26, 27, 28, 29, 30]). In all algorithms, a decision tree that distinguishes the different schema vari- simple schema consisting of element and attribute names, ants with value-based or structure-based conditions on nesting, optional and required information is derived. We each edge. Even if the approach in the paper is very differ- are not aware of any approaches to extracting complex ent from ours, we recognize the motivation that consid- schema constraints, such as tagged unions. ering value-based conditions in heterogeneous databases is important. JSON schema extraction. Early work on schema ex- traction from JSON data [1] adds — besides the schema Dependencies. For an overview of data dependencies, itself — also the extraction of statistics and a detection of we refer to a comprehensive survey [36], and focus on outliers. In [6], the extraction of schema versions over functional, inclusion, and conditional functional depen- time, as well as evolution operations mapping between dencies in the following. consecutive schema versions, is presented. Recent surveys of different schema extraction approach- Functional dependencies and inclusion dependen- es were provided by [4] (qualitative comparison) and [5] cies. Functional dependencies [37, 38, 39, 18] and in- clusion dependencies on semi-structured data [40, 18] 31 define semantic constraints on data to guarantee certain approach. In previous work on inferring inclusion de- data characteristics, to normalize data [41], and to ensure pendencies from JSON data [6], we suggested an efficient data quality. These constraints can also be applied in method that uses lattice characteristics to optimize the data cleaning. In [42], an overview over different meth- algorithm. To consider outliers, a threshold is introduced. ods and the semantic constraints, rules, or patterns to With it, it is possible to derive related inclusion depen- detect errors in the data cleaning process is given. Schel- dencies which are violated by a small number of outliers. ter et al. [43] use declarative rules in unit tests to check A further relaxation of functional dependency discov- the data quality (defined with several metrics). Some ex- ery has been suggested in [49] for data lakes. Here, out- amples are: completeness of datasets, range conditions, liers are not ignored, but properties that have different certain cardinalities and constraints on statistics. Both yet similar labels are combined. This method can be used approaches do not infer the semantic constraints from for data exploration and profiling of datasets with lower data but show how declarative rules can be leveraged data quality. during data preprocessing. The discovery of valid uniqueness constraints, func- Conditional functional dependencies. Conditional tional dependencies and inclusion dependencies is a well- functional dependencies [16] were first introduced for studied field in relational databases [44, 45, 46, 47, 48] relational data, in the context of data cleaning. We apply and even more relevant in JSON data (or, more generally, conditional functional dependencies in a new context, NoSQL data) because often these semantic constraints the relational encoding of JSON objects. are not predefined in the NoSQL databases. They are im- In [50, 51], three algorithms for CFD mining are pro- plicitly available in the data but for data cleaning tasks, posed and evaluated: CTANE and FastCFD build upon this information is required in form of rules and con- existing algorithms for FD discovery, TANE and FastFD, straints. Hence, development of algorithms to derive respectively and are intended for general CFD discov- explicit semantic constraints from NoSQL data is of par- ery. The third algorithm is custom-designed for constant ticular importance. CFDs, as also targeted by us. Methods for deriving semantic constraints from data In [52], additional rules for pruning the search space can build upon the algorithms developed for relational and consequently speeding up the mining process are databases. Additionally, the algorithms have to scale proposed for constant CFDs. Further approaches, based with large volumes of data, and be robust despite the on FD discovery and pattern mining, are explored in [21]. heterogeneity of datasets (variety) and low data quality (outliers in datasets). Arenas and Libkin define functional and inclusion de- 4. Approach pendencies for XML [17], as a basis for schema normal- ization. More recently, Mior [18] targets the mining of Our end-to-end approach is sketched in Figure 4. In our functional and inclusion dependencies from JSON data. upcoming walk-through of the architecture, we first focus Both formalisms do not allow to capture the conditional on the basic approach, as well as our heuristics against functional dependencies required in our context. overfitting and schema bloat. We then outline future However, Mior compares the performance of depen- work extensions to robustly generalize our approach to dency discovery in a relational encoding of the JSON a larger family of tagged unions. data (termed “static unrolling”) with a dynamic unrolling technique, showing that the latter has superior runtime 4.1. Architecture performance. Mior focuses on the discovery of functional Composability. We design our approach to be compos- dependencies and inclusion dependencies. While his ap- able with existing schema extraction algorithms: Given a proach does not consider the special case of conditional collection of JSON values, we extract a JSON Schema de- functional dependencies, which is relevant in our context, scription 𝑆, using any third-party tool. Along the process it is capable of deriving approximate dependencies, thus sketched along the bottom, we further derive the declara- being more robust w.r.t. vagueness in the input data. tion of tagged unions, denoted 𝑇 , a sequence of (possibly Kruse et al. developed in [40] the scalable discovery of nested) JSON Schema if-then-else statements. inclusion dependencies from data. Scalability of the ap- By the conjunction of the subschemas as proach is achieved by two reasons: the approach is only {"allOf": [S,𝑇 ]}, the composite schema requires the concentrating on discovery of unary inclusion depen- JSON values to satisfy both subschemas, imposing tagged dencies (consisting of one attribute on the left-hand and unions on schema 𝑆. This approach allows us to focus on the right-hand side) and is developing a distributed on the particular challenge of identifying tagged unions, approach. Combining both, a very efficient algorithm while conveniently leveraging state-of-the-art schema can be developed for analyzing large datasets. The het- extraction tools for describing the other aspects of the erogeneity of datasets has not been considered in this 32 Schema 𝑆 JSON JSON 8 Schema Schema Parse Comp. Position {allOf:[𝑆 ,𝑇 ]} Records of Rel. List CFDs if-then-else 7 Tree 1 Encoding Indices Candidate List Constraints 𝑇 CFDs List XYZ 𝜋X ={. . . } [X=v1 ]→[Y=t1 ] [X=v1 ]→[Y=t1 ] JSON · · · 𝜋Y ={. . . } 5 2 A B C 3 · · · 4 [X=v2 ]→[Y=t2 ] 6 [X=v2 ]→[Y=t2 ] · · · 𝜋Z ={. . . } [X=v3 ]→[Z=t3 ] [X=v3 ]→[Z=t3 ] Schema Extrac- ... ... ... ... ... ... ... ... ... tion, e.g. [1] Dependency Discovery Figure 4: System architecture overview. (1) A third-party tool is used to extract a JSON Schema description 𝑆 of JSON input data. Steps (2) through (7) visualize the discovery of tagged unions as schema 𝑇 , and are described in Section 4. In step (8), the schemas 𝑆 and 𝑇 are composed into a composite schema. Boxed areas capture state-of-the-art algorithms integrated in our architecture. A, B and C are JSON property labels while X, Y and Z are attributes in our relation encoding (e.g., A.value). schema. We leave it to future work to recursively merge implied subschema. Recall that with the dot-notation, we the composite schema, thereby improving the readability distinguish the value of JSON property 𝐴 (i.e., 𝐴.value) as well as the locality of the interacting constraints. from the subschema of sibling property 𝐵 (i.e., 𝐵.type) in our relational encoding of JSON objects. Relational encoding. In a first step towards the dis- Example 7. From the relational encoding in Table 1, we covery of tagged unions, we parse all JSON values into a derive the following dependencies, again using the abbre- single parse tree (adding a virtual root node), and deter- viations t1 and t2 for the subschemas introduced earlier: mine the set of all labeled paths from the root to a JSON [type.value = "Point"] → [coordinates.type = t1] object. For each such path, we then derive the relational [type.value = "LineString"] → [coordinates.type = t2] encoding of all properties in the reachable objects. We have already introduced this data structure in Section 2, Recall that in the context of our application, they are and resume the discussion of our running example. interpreted as follows. Given a JSON object reachable by the path /geometries[*], if the value of property Example 6. Consider the GeoJSON value in Figure 1. type is “Point”, the value of the sibling property coordi- Table 1 shows the relational encoding of the four objects nates must conform to the subschema t1 (an array of reachable by the path /geometries[*]. The role of numbers). If the value of property type is “LineString”, the unique object identifier is to preserve cardinalities then the value of sibling property coordinates must in repeating attribute values, allowing us to later filter conform to the subschema t2 (an array of points). dependencies based on their support in the input. In the discovery of these dependencies, we leverage The running example has been designed to be simple, state-of-the-art algorithms [53, 47, 54, 55] (originally de- and it ignores challenges such as properties with varying veloped for the discovery of traditional functional de- types, dealing with several properties whose values have pendencies), such as a compressed-records encoding of basic types, or null-values in the relational encoding. Our the relation, computing position list indexes (PLIs) and approach can be extended to robustly handle these cases producing a sequence of candidate dependencies. PLIs as well, and we refer to Section 4.3 for a discussion on are calculated for each column of the relational encoding the most interesting future generalizations. and allow to effectively group all rows (identified by their row number) with the same value in the same subset. Dependency discovery. Based on the relational en- In the following, we outline the general idea. Inferring coding, we discover conditional functional dependencies. conditional functional dependencies from PLIs can be We restrict ourselves to unary constant conditional de- achieved by finding inclusions between them: given PLIs pendencies in our relational encoding, of the form 𝜋A.value = {{1, 2, 3}, {4, 5, 6}} of attribute 𝐴.value and 𝜋B.type = {{1, 2, 3, 4}, {5}, {6}} of attribute 𝐵.type, we [𝐴.value = 𝑐] → [𝐵.type = 𝜎] receive the inclusion {1, 2, 3} ⊂ {1, 2, 3, 4}, allowing us to infer that the values of JSON property A corresponding where the left-hand-side denotes the value of the can- to rows 1, 2 and 3 determine the subschemas of the sibling didate tag, and the right-hand-side a property with an property B. implied subschema. Above, 𝐴 and 𝐵 are distinct prop- More sophisticated algorithms developed for the dis- erty labels, 𝑐 is a basic-value constant, and 𝜎 denotes the covery of conditional functional dependencies (CFDs), 33 such as CFDMiner, CTANE and FastCFD [50], could be Minimum Threshold 𝜋min . Especially in large data- deployed just as well. As our current restrictions, most sets, it is to be expected that dependencies are inferred notably the one to unary dependencies, obfuscate many that have no real semantics, despite us ignoring attributes of the challenges in FD and CFD discovery, we opted forwith unique values. Our implementation has a config- our simpler approach. urable minimum threshold: Dependencies with insuffi- cient support in the input are then ignored during the Pruning candidate dependencies. To avoid over- generation of if-then-else statements. fitting in the derived schemas, we heuristically prune By allowing to configure this threshold, either as an candidate dependencies, as described in the upcoming absolute or relative value, the number of discovered de- Section 4.2. pendencies can be influenced. Our experiments evaluate this effect. Finalization. In a final step, the conditional functional dependencies are transformed into nested if-then-else 4.3. Future Generalizations of the statements in JSON Schema 𝑇 . This step is merely tech- Approach nical, as is the construction of the composite schema. In GeoJSON data, which served as our motivation for the discovery of tagged unions, the geometry objects 4.2. Heuristics are highly regular. When inspecting further instances In our approach, an inherent challenge is dealing with of tagged unions in real-world data, we encounter vari- high numbers of discovered dependencies, which may ations that require us to generalize our approach. For lead to overfitting the derived schema w.r.t. the input data. illustration, we again resort to an example. We next distinguish default heuristics and heuristics that are configurable by an expert user. 1 [ { " condition ": " minecraft : time_check " , 2 " period ": 24000 , 3 " value ": { " min ": 0.0 } } ] 4.2.1. Default Heuristics (a) The value of property min (line 3) is numeric. The following heuristics are always applied, as they pre- vent schema bloat due to overfitting. 1 [ { " condition ": " minecraft : weather_check " , 2 " raining ": false , 3 " thundering ": false } , Single-valued attributes. In discovering dependen- 4 { " condition ": " minecraft : time_check " , cies, we ignore all attributes from the relational encoding 5 " period ": 24000 , 6 " value ": { with a single-valued domain. In Table 1, this concerns 7 " min ": { 8 " target ": { the attribute type.type. 9 " name ": "% StartSunburn " , 10 " type ": " minecraft : fixed " } , 11 " score ": " Daytime " , Unique attributes. Unique property values are rec- 12 " type ": " minecraft : score " } } } ] ognized as conditional functional dependencies, and ul- timately, cause overfitting and schema bloat. In depen- (b) The value of property min (line 7) is of type object. dency discovery, we therefore ignore all attributes where the domain consists of unique values (such as the object Figure 5: Minecraft JSON data (snippets edited and short- ened, clickable links to GitHub source in the subcaptions). identifier 𝑂.id, which does not appear in the input data, Property condition functions as the tag in the tagged union. and need not be considered in dependency discovery). Union rule. We apply the union rule to the discovered Example 8. We consider JSON data used in the Minecraft dependencies, so that one tag may imply the subschemas game, as shown in Figure 5. The property condition for several sibling properties. This improves schema functions as a tag, and distinguishes notions such as “time succinctness. checks” and “weather checks”, among others. Depending on the tag value, different sibling properties exist: Given 4.2.2. Configurable Heuristics that the tag value is a “weather check”, sibling properties labeled raining and thundering exist. Given that the The following heuristics is reasonable in many cases. As tag value is a “time check”, sibling properties labeled it is not universally applicable, we make it configurable. period and value exist. To extend our approach to such more general forms of tagged unions, we need to be able to detect condi- tional dependencies between the value of a candidate tag, 34 and the existence of a sibling property (which we term as a fully automated reproduction package, are avail- value-label constraints, to distinguish from the value- able online [12]. Additionally, in the PDF version of this type constraints that we already handle). Here, a specific article, clickable links in Table 2 allow our readers to di- challenge is to handle optional sibling properties, which rectly inspect the input datasets, as well as the extracted are not always present. schemas. Example 9. Resuming the previous example of a Minecraft “time check”, the subschema for property value is an Datasets. We consider the five JSON datasets listed in object with a property min. The subschema of property Table 2, all in the GeoJSON or the related TopoJSON for- min, however, may vary. In the top example, it is a nu- mat. For simplicity, we will commonly treat the formats meric value. In the bottom example, it is a nested object. GeoJSON and TopoJSON synonymously, in discussing Our approach outlined so far cannot detect this condi- our results. tional functional dependency, because the subschemas We identified these datasets by manually searching of the sibling property differ. open data servers, as well as by performing a search over all open-source licensed repositories on GitHub using To detect value-type dependencies where the types on Google BigQuery. Our selection criteria were (1) a suf- the right-hand-side may vary, we need to relax the rec- ficiently large document size and (2) that at least two ognized subschema. In our current prototype implemen- different types of geometries co-occur, so that we may tation, we introduce attributes min.type1 , . . . , min.type𝑘 be able to infer tagged unions. (e.g., for a fixed limit 𝑘 = 6), where we describe the We briefly describe the datasets: subschema with a varying degree of detail. For instance, • Map Germany is a TopoJSON dataset, describ- from a shallow min.type1 = {"type": "object} to ing a map of Germany and surrounding areas, a very fine-grained description that fully captures all including cities, urban areas, and rivers. This nested objects. In pruning candidate dependencies, we dataset uses geometry types Point, LineString, only keep the most detailed subschema for which a CFD MultiLineString, Polygon and MultiPolygon. still holds. This allows us to detect a common subschema • Map EU is a TopoJSON dataset. Its structure is in many practical cases. comparable to Map Germany but describes coun- tries in the EU in detail and contains the borders 5. Experiments of some adjacent countries. • WorldGen Berlin is a GeoJSON dataset from Open- We next describe our experiments using real-world geo- StreetMap. spatial datasets encoded in JSON. For notational simplic- • US House Members lists members of the US House ity, we will write “CFDs” when we refer to the subfamily of representatives, including their place of birth, of unary constant conditional functional dependencies, hometown and congressional district, encoded in as introduced in Section 2. GeoJSON. • Wildlife Sites contains wildlife sites in West York- 5.1. Setup shire, described either as a GeoJSON Polygon or MultiPolygon. Implementation. Our prototype is implemented in The datasets encode GeoJSON or the related TopoJSON Python, using the library anytree for managing parse format (distinguished as formats “T” and “G” in the table). trees. For schema validation, we use the Python library We list the size of each dataset in lines of code after pretty- jsonschema (clickable links in the PDF). printing (column |𝐷|). For the “third-party” schema extraction tool, we em- In lieu of a ground truth describing the number of con- ploy a tool that we have built in earlier work [1]. Note ditional functional dependencies in the data that is not that our approach is designed to work with any other tool GeoJSON-encoded, we report in the table the number for JSON Schema extraction. Also, since we construct of detectable CFDs in the GeoJSON-parts only (𝐷𝑒𝑡𝐺𝑒𝑜 ). the composite schema by combining two subschemas, We manually determine this value for each dataset, in- we are not restricted to Python-based tools. specting all distinct types of GeoJSON objects (e.g., Point, To confirm the composability of our approach, we have LineString, Polygon) on each different path. We ignore successfully built a variant of our architecture based on all paths where the value of the GeoJSON type does not the schema extraction tool by Frozza et al. [31, 56] (with vary. While these can be detected as CFDs from the data, minor technical adaptions), but we do not report the they are not meaningful for recognizing tagged unions, results, since they do not contribute new insights. and should be modeled, for instance, simply as constant values instead. Artifact availability and reproducibility. Our re- search artifacts (code, data, extracted schemas), as well 35 Table 2 Summary of experimental results. Links embedded in the PDF refer to the original dataset (𝐷). We state the dataset format (TopoJSON or GeoJSON) and report its size (|𝐷|) in lines of code (LoC), the size of the conventionally extracted schema 𝑆 (in LoC) and the number of theoretically detectable Geo-/TopoJSON CFDs (𝐷𝑒𝑡𝐺𝑒𝑜 ). For different threshold settings 𝜋min , we produce a subschema 𝑇 encoding the tagged unions. “Ratio𝑇 ” reports the share of 𝑇 w.r.t. the size of the composite schema. Further, the number of CFDs initially discovered (w/o heuristics), after applying threshold 𝜋min (see Section 4.2.2), and after applying the remaining default heuristics (see Section 4.2.1) are reported. The last column reports the result of checking whether dataset 𝐷 is valid w.r.t. the extracted composite schema (a checkmark symbol “✓” denotes a successful check). The checkmarks are clickable links, embedded in the PDF, to the composite schema. Dataset D For- |D| |S| DetGeo 𝜋min |T| Ratio CFDs w/o CFDs CFDs w/ Va- mat (LoC) (LoC) (LoC) 𝑇 Heuristics w/ 𝜋min and lid 𝜋min Heuristics 50% 126 23.6% 11 4 ✓ Map Germany T 181 585 405 6 35% 158 27.9% 44 14 5 ✓ 15% 190 31.7% 20 6 ✓ 50% 95 20.7% 6 3 ✓ Map EU T 122 380 359 6 35% 95 20.7% 10 6 3 ✓ 15% 127 25.9% 8 4 ✓ 50% 150 10.0% 76 4 ✓ WorldGen B. G 56 441 1 339 11 35% 182 11.9% 1 243 109 5 ✓ 15% 278 17.1% 184 8 ✓ 50% 54 9.8% 459 2 ✓ US House M. G 22 745 493 2 35% 54 9.8% 719 470 2 ✓ 15% 54 9.8% 660 2 ✓ 50% 38 20.5% 18 1 ✓ Wildlife Sites G 876 482 143 2 35% 38 20.5% 360 18 1 ✓ 15% 92 38.5% 50 3 ✓ Execution environment. Our experiments were con- We use DetGeo as a target in discussing recall of depen- ducted on an off-the-shelf notebook with an Intel i7- dency discovery in the GeoJSON part of the input. 1165G7 CPU with 4.7GHz and 16GB of main memory. 5.3. Results 5.2. Experimental Design Table 2 summarizes our results. Note that all composite Workflow. For each data collection, we extract com- schemas successfully validate against Draft-07, and fur- posite schemas choosing three settings for the minimum ther, that all JSON datasets validate against the composite threshold 𝜋min : an aggressive setting of 50%, requiring a schemas. conditional functional dependency to occur in at least Decreasing the threshold 𝜋min leads to more condi- half of the tuples in the relational encoding of all objectstional functional dependencies being discovered. The reachable by a given path, and more lenient settings with share of the subschema encoding tagged unions com- 35% and 15%, respectively. pared to the entire composite schema ranges between For each threshold setting, we generate schemas 𝑆 10% and up to approx. 38%. While the 38% share might (from the third-party tool) and 𝑇 (encoding the tagged appear to be large, the composite schema has less than unions), as well as the composite schema. 250 lines in total, compared to 800K lines of JSON input. Each composite schema is validated against the spec- Thus, this dataset is highly regular in its structure, and ification of JSON Schema Draft-07, confirming that the the extracted schema comparatively compact. composite schema conforms. The input datasets are fur- For the bottom three datasets, we detect conditional ther validated against the extracted composite schema, functional dependencies in the hundreds, even more than checking for any logical errors in schema extraction. one thousand in the case of the WorldGen Berlin data. However, applying the threshold and the other heuris- Metrics. In our quantitative assessment, we report tics drastically reduces the dependencies towards the schema sizes (after syntactic normalization by pretty- number of tagged unions that we expect to find (c.f. the printing), the number of conditional functional depen- target DetGeo ). Note that there are only six distinct ge- dencies initially discovered, and after filtering according ometries in GeoJSON which may, however, occur on to our heuristics. different paths, leading to more than six if-then-else 36 statements in the GeoJSON part. listed in Example 2), as well as synthetic data, and data After applying heuristics and a threshold of 15%, we not encoding any tagged unions, are required. only miss two detectable CFDs in Map EU and three in We do not focus on the performance of our unopti- WorldGen Berlin. Missed CFDs can be attributed to the mized prototype implementation, as the algorithm is threshold being too coarse. only executed once for each dataset. Of course, scal- Manually inspecting the final dependencies, we find ability to larger inputs is an issue to be addressed in one unexpected dependency for Wildlife Sites with a future work, as CFD discovery in general is an expensive threshold of 15%. This is the only false positive depen- problem: Already in addressing the discovery of func- dency across all datasets. It results in a tagged union tional dependencies in their full generality, algorithms being declared for a specific date, which is not seman- have exponential runtime complexity [57]. Several algo- tically meaningful. Further, this dependency is outside rithms for the specific problem of CFD discovery have the GeoJSON-encoded part of the data, explaining why been proposed [50]. Their evaluations show that runtime the number of detected CFDs is higher than 𝐷𝑒𝑡𝐺𝑒𝑜 . performance depends heavily on the input, with some Thus, we have a single case of overfitting for the datasets algorithms scaling better with size of the dataset (i.e., the analyzed. Overall, we observe very high precision in rec- number of tuples) thus being suitable for large datasets, ognizing the GeoJSON/TopoJSON dependencies resident while others perform better with higher arity. For con- in the data. stant CFDs, Li et al. [52] achieved promising runtime We do not focus on runtime evaluation, as our im- improvements by applying custom rules for pruning the plementation is prototypical and unoptimized. Yet to search space. provide a general perspective, we share our observation With strong restrictions to the problem space, such as that runtimes vary greatly between datasets, ranging our restriction to unary constant CFDs in our case, we can from roughly one second to approximately one minute. expect reasonable runtime performance on real-world This is a waiting time which we deem acceptable for a inputs (which are often reasonably well-behaved). non-interactive, irregular task. In our experiments, we observed a considerable impact Since our approach is main-memory based, memory of our heuristics and threshold on runtime, reducing the is a physical limitation to the inputs that we can pro- processing time for all datasets by up to two orders of cess. For the datasets considered here, 16GB of RAM are magnitude. This is promising for further optimizations. sufficient to run our experiments. 6. Conclusion and Outlook 5.4. Discussion The comparison between the number of dependencies In this article, we proposed a method to infer conditional found with different heuristics shows the effectiveness functional dependencies from JSON data. We use the of these heuristics in pruning dependencies. In particu- identified dependencies as the basis for declaring tagged lar, this concerns the configuration knob represented by unions in JSON schema extraction. minimum threshold. This allows us to capture value-based constraints. In Our approach to setting this threshold is rather coarse, fact, in [35] Gallinucci et al. report on expert interviews with 50% obviously too rigid, but 15% delivering mean- with users, regarding the users’ preferences in schemas ingful results. In future work, we plan to investigate over nested data. Their interviews reveal that value- how to auto-adjust the threshold, based on statistical based conditions have a greater influence on the differ- distributions obtained by profiling the input. entiation of schema variants than structural constraints, An obvious threat to the generalizability of our re- and are therefore preferred. sults — however promising they are — is that the Geo- In future work, our approach can be extended for the JSON datasets are highly regular in their structure. In extraction of further variants of dependencies, for in- the GeoJSON-part, they contain only a small number of stance, traditional conditional functional dependencies detectable dependencies, between two and eleven. There- that capture implications between atomic property val- fore, we refrain from computing the metrics precision ues, or dependencies where the value of the tag property and recall, as they are easily distorted when working implies the existence of a specific sibling property, as with small numbers. However, the manual inspection of discussed in greater detail in Section 4.3. This allows to the derived dependencies, and the comparison against recognize a larger family of tagged unions. the target DetGeo , shows that our approach is indeed suc- Our prototype implementation is currently main-mem- cessful for the input datasets chosen. ory based, which limits the size of inputs that we can To counter the threat of generalizability, further experi- handle. Making our implementation scale to larger inputs ments over different real-world datasets (e.g., the datasets is one of the immediate next steps. Here, we may build upon first results by Mior [18], who shows that discover- 37 ing dependencies in a relational encoding of JSON data [4] P. Contos, M. Svoboda, JSON schema inference has inferior runtime performance when compared to dis- approaches, in: Proc. ER (Workshops), 2020, pp. covering dependencies in a streaming fashion. Also, we 173–183. plan to consider a MapReduce-based approach to schema [5] I. Veinhardt Latták., P. Koupil., A comparative anal- extraction, as implemented by Baazizi et al. in [2]. ysis of JSON schema inference algorithms, in: Proc. Ultimately, our goal is to obtain a schema declaration ENASE, 2022, pp. 379–386. that human consumers consider to be comprehensive, but [6] M. Klettke, H. Awolin, U. Störl, D. Müller, that may also be efficiently processed programmatically. S. Scherzinger, Uncovering the evolution history of In order to obtain more succinct schemas, we need to data lakes, in: Proc. Big Data, 2017, pp. 2462–2471. resolve redundancies between the schema extracted by a [7] M. A. Baazizi, C. Berti, D. Colazzo, G. Ghelli, C. Sar- third-party tool and our encoding of tagged unions. This tiani, Human-in-the-loop schema inference for requires rewriting the composite schema based on an massive JSON datasets, in: Proc. EDBT, 2020, pp. algebraic representation of JSON Schema operators, such 635–638. as the schema algebra proposed by Attouche et al. [13]. [8] J. Namba, Enhancing JSON schema discovery by Further, identifying metrics that capture the quality of uncovering hidden data, in: Proc. VLDB 2021 PhD the extracted schemas will allow to quantitatively com- Workshop, 2021. pare schemas extracted by different approaches. Given [9] M. Droettboom, Understanding JSON schema, suitable metrics, the configuration of heuristics could https://json-schema.org/understanding-json- even be adjusted automatically. A possible direction is schema/reference/conditionals.html#if-then-else, to explore the notions of precision and recall, and the 2022. Draft 2020-12. proxy-metric of schema entropy, as introduced by Spoth [10] M. A. Baazizi, D. Colazzo, G. Ghelli, C. Sartiani, et al. in [3]. S. Scherzinger, An empirical study on the "usage A further task is to adopt a CFD inference algorithm of not" in real-world JSON schema documents, in: that is robust despite poor data quality and that can infer Proc. ER, 2021, pp. 102–112. constraints despite outliers in the data. Naturally, this [11] A. A. Frozza, R. dos Santos Mello, Js4geo: A canon- requires a relaxation to “soft” CFDs, a task where we may ical JSON schema for geographic data suitable to also build upon existing work on relational [21] and even NoSQL databases, GeoInformatica 24 (2020) 987– JSON data [18]. 1019. In summary, our long-term vision is to extract com- [12] S. Klessinger, M. Klettke, U. Störl, S. Scherzinger, prehensible and therefore human-consumable schema Extracting JSON Schemas with Tagged declarations from JSON data. We believe that the detec- Unions (Reproduction Package), 2022. URL: tion of schema design patterns that are popular among https://doi.org/10.5281/zenodo.6985647. schema designers, such as tagged unions, is an important doi:10.5281/zenodo.6985647. building block towards realizing this vision. [13] L. Attouche, M. A. Baazizi, D. Colazzo, G. Ghelli, C. Sartiani, S. Scherzinger, Witness genera- Acknowledgments: This work was funded by Deut- tion for JSON schema, CoRR abs/2202.12849 sche Forschungsgemeinschaft (DFG, German Research (2022). URL: https://arxiv.org/abs/2202.12849. Foundation) grant #385808805. We thank Thomas Kirz arXiv:2202.12849. for expertly typesetting the systems architecture in LATEX. [14] F. Pezoa, J. L. Reutter, F. Suárez, M. Ugarte, D. Vrgoc, We thank the anonymous reviewers of DEco’22, espe- Foundations of JSON Schema, in: Proc. WWW, cially reviewer #2, for the detailed and helpful feedback. 2016, pp. 263–273. [15] F. Suárez, J. Reutter, D. Vrgoc, M. Ugarte, F. Pezoa, JSON Schema: Multiple Types (Software), https:// References cswr.github.io/JsonSchema/spec/multiple_types/, 2016. [1] M. Klettke, U. Störl, S. Scherzinger, Schema extrac- [16] P. Bohannon, W. Fan, F. Geerts, X. Jia, A. Kementsi- tion and structural outlier detection for JSON-based etsidis, Conditional functional dependencies for NoSQL data stores, in: Proc. BTW, volume P-241, data cleaning, in: Proc. ICDE, 2007, pp. 746–755. GI, 2015, pp. 425–444. [17] M. Arenas, L. Libkin, A normal form for xml [2] M. A. Baazizi, D. Colazzo, G. Ghelli, C. Sartiani, documents, ACM Trans. Database Syst. 29 (2004) Parametric schema inference for massive JSON 195–232. datasets, VLDB J. 28 (2019) 497–521. [18] M. J. Mior, Fast discovery of nested dependencies on [3] W. Spoth, O. Kennedy, Y. Lu, B. C. Hammerschmidt, JSON data, CoRR abs/2111.10398 (2021). URL: https: Z. H. Liu, Reducing Ambiguity in JSON Schema //arxiv.org/abs/2111.10398. arXiv:2111.10398. Discovery, in: Proc. SIGMOD, 2021, pp. 1732–1744. [19] S. Kruse, A. Jentzsch, T. Papenbrock, Z. Kaoudi, 38 J. Quiané-Ruiz, F. Naumann, Rdfind: Scalable con- [35] E. Gallinucci, M. Golfarelli, S. Rizzi, Schema profil- ditional inclusion dependency discovery in RDF ing of document-oriented databases, Inf. Syst. 75 datasets, in: Proc. SIGMOD, 2016, pp. 953–967. (2018) 13–25. [20] J. Friesen, Java XML and JSON: Document Process- [36] S. Song, F. Gao, R. Huang, C. Wang, Data depen- ing for Java SE, Apress, 2019, pp. 299–322. dencies extended for variety and veracity: A family [21] J. Rammelaere, F. Geerts, Revisiting conditional tree, IEEE Transactions on Knowledge and Data functional dependency discovery: Splitting the "C" Engineering 34 (2022) 4717–4736. from the "FD", in: Proc. ECML PKDD, 2018, pp. [37] M. L. Lee, T. W. Ling, W. L. Low, Designing func- 552–568. tional dependencies for XML, in: Proc. EDBT, 2002, [22] W3C, W3C XML Schema Definition Language pp. 124–141. (XSD) 1.1 Part 1: Structures, W3C Recom- [38] S. Hartmann, S. Link, More functional dependencies mendation, 2012. URL: https://www.w3.org/TR/ for xml, in: L. Kalinichenko, R. Manthey, B. Thal- xmlschema11-1/. heim, U. Wloka (Eds.), Advances in Databases and [23] ISO/IEC, ISO/IEC 19757-3:2020 Information tech- Information Systems, Springer Berlin Heidelberg, nology — Document Schema Definition Languages Berlin, Heidelberg, 2003, pp. 355–369. (DSDL) — Part 3: Rule-based validation — Schema- [39] L. Kot, W. M. White, Characterization of the inter- tron, ISO/IEC Standard, 2020. URL: https://www. action of XML functional dependencies with DTDs, iso.org/standard/74515.html. in: Proc. ICDT, 2007, pp. 119–133. [24] ISO/IEC, ISO/IEC 19757-2:2008 Information tech- [40] S. Kruse, T. Papenbrock, F. Naumann, Scaling out nology — Document Schema Definition Language the discovery of inclusion dependencies, in: Proc. (DSDL) — Part 2: Regular-grammar-based valida- BTW, 2015, pp. 445–454. tion — RELAX NG, ISO/IEC Standard, 2008. URL: [41] T. Vajk, P. Fehér, K. Fekete, H. Charaf, Denormal- https://www.iso.org/standard/52348.html. izing data into schema-free databases, in: Proc. [25] M. Garofalakis, A. Gionis, R. Rastogi, S. Seshadri, CogInfoCom, 2013, pp. 747–752. K. Shim, XTRACT: A system for extracting docu- [42] X. Chu, I. F. Ilyas, S. Krishnan, J. Wang, Data clean- ment type descriptors from XML documents, in: ing: Overview and emerging challenges, in: Proc. Proc. SIGMOD, 2000, pp. 165–176. SIGMOD, 2016, pp. 2201–2206. [26] C.-H. Moh, E.-P. Lim, W.-K. Ng, DTD-Miner: a tool [43] S. Schelter, D. Lange, P. Schmidt, M. Celikel, F. Biess- for mining DTD from XML documents, in: Proc. mann, A. Grafberger, Automating large-scale data WECWIS, 2000, pp. 144–151. quality verification, Proc. VLDB Endow. 11 (2018) [27] J. Hegewald, F. Naumann, M. Weis, Xstruct: Ef- 1781–1794. ficient schema extraction from multiple and large [44] J. Kivinen, H. Mannila, Approximate inference of XML documents, in: Proc. Workshops ICDE, 2006. functional dependencies from relations, Theoretical [28] M. Klempa, M. Kozák, M. Mikula, R. Smetana, Computer Science 149 (1995) 129–149. J. Stárka, M. Svirec, M. Vitásek, M. Necaský, I. Hol- [45] H. Mannila, K.-J. Räihä, On the complexity of in- ubová, jInfer: A framework for XML schema infer- ferring functional dependencies, Discrete Applied ence, Comput. J. 58 (2015) 134–156. Mathematics 40 (1992) 237–243. [29] B. Chidlovskii, Schema extraction from xml collec- [46] H. Yao, H. J. Hamilton, Mining functional depen- tions, in: Proc. JCDL, 2002, p. 291–292. dencies from data, Data Mining and Knowledge [30] I. Mlýnková, M. Nečaský, Heuristic methods for in- Discovery 16 (2008) 197–219. ference of xml schemas: Lessons learned and open [47] T. Papenbrock, F. Naumann, A hybrid approach issues, Informatica 24 (2013) 577–602. to functional dependency discovery, in: Proc. SIG- [31] A. A. Frozza, R. dos Santos Mello, F. de Souza da MOD, 2016, pp. 821–833. Costa, An approach for schema extraction of JSON [48] J. Bauckmann, U. Leser, F. Naumann, V. Tietz, Effi- and extended JSON document collections, in: Proc. ciently detecting inclusion dependencies, in: Proc. IRI, 2018, pp. 356–363. ICDE, 2007, pp. 1448–1450. [32] D. S. Ruiz, S. F. Morales, J. G. Molina, Inferring [49] R. Hai, C. Quix, D. Wang, Relaxed functional de- versioned schemas from NoSQL databases and its pendency discovery in heterogeneous data lakes, applications, in: Proc. ER, 2015, pp. 467–480. in: Proc. ER, 2019, pp. 225–239. [33] J. L. C. Izquierdo, J. Cabot, Discovering implicit [50] W. Fan, F. Geerts, L. V. S. Lakshmanan, M. Xiong, schemas in JSON data, in: Proc. ICWE, 2013, pp. Discovering conditional functional dependencies, 68–83. in: Proc. ICDE, 2009, pp. 1231–1234. [34] D. Durner, V. Leis, T. Neumann, JSON tiles: Fast an- [51] W. Fan, F. Geerts, J. Li, M. Xiong, Discovering alytics on semi-structured data, in: Proc. SIGMOD, conditional functional dependencies, IEEE Trans. 2021, pp. 445–458. Knowl. Data Eng. 23 (2011) 683–698. 39 [52] J. Li, J. Liu, H. Toivonen, J. Yong, Effective pruning for the discovery of conditional functional depen- dencies, Comput. J. 56 (2013) 378–392. [53] I. F. Ilyas, X. Chu, Data Cleaning, ACM, 2019. [54] P. Schirmer, T. Papenbrock, S. Kruse, F. Nau- mann, D. Hempfing, T. Mayer, D. Neuschäfer-Rube, Dynfd: Functional dependency discovery in dy- namic datasets, in: Proc. EDBT, 2019, pp. 253–264. [55] Z. Wei, S. Link, Discovery and ranking of functional dependencies, in: Proc. ICDE, 2019, pp. 1526–1537. [56] A. A. Frozza, R. dos Santos Mello, F. de Souza da Costa, An approach for schema extraction of JSON and extended JSON document collec- tions (software), https://github.com/gbd-ufsc/ JSONSchemaDiscovery, 2018. [57] J. Liu, J. Li, C. Liu, Y. Chen, Discover dependencies from data - A review, IEEE Trans. Knowl. Data Eng. 24 (2012) 251–264. 40