New Workflows in NoSQL Schema Management∗ Michael Fruth Kai Dauberschmidt Stefanie Scherzinger University of Passau University of Passau University of Passau Passau, Germany Passau, Germany Passau, Germany michael.fruth@uni-passau.de stefanie.scherzinger@uni-passau.de ABSTRACT Schema extraction & validation. The DevOps team first has to Many NoSQL document stores allow for flexibility w.r.t. schema extract a schema declaration from the persisted data [2, 9, 13–15]. management: For instance, MongoDB allows to switch between Often, schema extraction algorithms rely on sampling to cope with a schema-free and a schema-fixed mode of operation. For declar- large data volumes. Consequently, the extracted schema may not ing such schemas, the JSON Schema language has become highly faithfully describe the entire data instance. In order to avoid valida- popular. We introduce the prototype software Josch, first demoed tion errors at runtime, the entire data instance needs to be validated at ICDE 2021, which enhances the NoSQL schema management against the extracted schema. This impacts database performance. workflow by integrating novel tools for checking JSON Schema containment. We point out new research challenges in this context. Schema refactoring & containment checking. When the schema is edited, e.g. adjusting it to account for outlier documents, or Reference Format: restructuring it for better readability, the team risks that the schema Michael Fruth, Kai Dauberschmidt, and Stefanie Scherzinger. New semantics is unintentionally changed. In JSON Schema containment Workflows in NoSQL Schema Management. In the 2nd Workshop on checking, two JSON Schema declarations are compared based on Search, Exploration, and Analysis in Heterogeneous Datastores (SEA Data their semantics. Thus, we can automatically decide whether the 2021). schema semantics has been changed. For illustration, let us consider two excerpts of JSON Schema documents that describe the month of a publication, 𝑆1: {"type": Artifact Availability: ["number","string"]} and 𝑆2: {"type": ["number"]}. Schema The source code has been made available online at [7]. 𝑆2 is contained in 𝑆1, and therefore more restrictive, as it requires the month to be numeric, whereas 𝑆1 also allows a string. 1 OVERVIEW NoSQL document stores such as MongoDB allow to switch between a schema-free and a schema-fixed mode of operation, by registering 3 RESEARCH CHALLENGES a JSON Schema [4, 11] declaration. Apart from solutions for isolated We refer to our extended version [6] of this paper for a more de- tasks, such as extracting a schema declaration from persisted docu- tailed discussion of related work. The full workflow just outlined ments, or validating documents against this schema, there are tools is supported by our software prototype Josch [6, 7], where Josch that combine these steps into comprehensive end-to-end schema is geared to (but not limited to) MongoDB, and employs the third- management workflows (e.g. Hackolade [9] or Darwin [12, 16]). party tools jsonsubschema [8] and is-json-schema-subset [10] for Towards this family of software products, we contribute a new JSON Schema containment checking. prototype called Josch [6, 7], where we enhance schema manage- State-of-the-art JSON Schema containment checkers do not pro- ment workflows by integrating novel tools for checking JSON vide any explanation as to why two schemas differ. As a form of Schema containment. In interaction with Josch, we identify new re- explainability, we may resort to generating a witness document [1], search challenges for both practitioners and theoreticians working i.e., a JSON document that is valid w.r.t. one schema but not the on search, exploration, and analysis in heterogeneous datastores. other. At the moment, this is still a young research field. Another limitation of current JSON Schema containment check- 2 WORKFLOWS ers are negation and recursive references [5]. While negation is Our application scenario showcases a DevOps team who started ap- rarely used in real-world schemas, it can lead to complex schemas [3]. plication development and production operations with a MongoDB The extracted schemas tend to be simplistic, yet highly verbose. backend in schema-free mode. For data quality assurance, the team A semi-automated refactoring that automatically extracts and in- at one point decides to register a JSON Schema declaration with its troduces references for repeating structures to alleviate these short- MongoDB backend, so all writes are validated against this schema. comings could prove helpful. Yet both schema refactorization and the extraction of complex schemas are open research challenges. ∗ An extended version of this work has been presented as a demo at ICDE 2021 [6]. Copyright © 2021 for the individual papers by the papers’ authors. Copyright © 2021 for the volume as a collection by its editors. This volume and its papers are published 4 OUTLOOK under the Creative Commons License Attribution 4.0 International (CC BY 4.0). Solutions to the challenges outlined would also find application Published in the Proceedings of the 2nd Workshop on Search, Exploration, and Anal- ysis in Heterogeneous Datastores, co-located with VLDB 2021 (August 16-20, 2021, beyond NoSQL schema management, e.g., in the static validation Copenhagen, Denmark) on CEUR-WS.org. of machine learning pipelines, as in the IBM LALE project [8]. ACKNOWLEDGMENTS [5] Michael Fruth, Mohamed-Amine Baazizi, Dario Colazzo, Giorgio Ghelli, Carlo Sartiani, and Stefanie Scherzinger. 2020. Challenges in Checking JSON Schema We thank Mohamed-Amine Baazizi, Dario Colazzo, Giorgio Ghelli, Containment over Evolving Real-World Schemas. In Proc. EmpER. 220–230. and Carlo Sartiani for sharing their insights on JSON Schema, Uta [6] Michael Fruth, Kai Dauberschmidt, and Stefanie Scherzinger. 2021. Josch: Man- aging Schemas for NoSQL Document Stores. In Proc. ICDE. 2693–2696. Störl for her comments on our full version of this paper, and the [7] Michael Fruth, Kai Dauberschmidt, and Stefanie Scherzinger. 2021. sdbs-uni- authors of [8] for assistance in using their tool. We thank Pascal p/josch: Josch Version 1.0.0. https://doi.org/10.5281/zenodo.5155117 Desmarets for providing us with an academic Hackolade license, [8] Andrew Habib, Avraham Shinnar, Martin Hirzel, and Michael Pradel. 2021. Find- ing Data Compatibility Bugs with JSON Subschema Checking. In ISSTA. 620–632. as well as his feedback from the practitioners’ point-of-view. [9] Hackolade. online. Hackolade. https://hackolade.com This project was supported by the Deutsche Forschungsgemein- [10] haggholm. online. is-json-schema-subset. https://github.com/haggholm/is-json- schema-subset version 1.1.24. schaft (DFG, German Research Foundation), grant #385808805. [11] JSON Schema. online. JSON Schema. https://json-schema.org [12] Meike Klettke, Hannes Awolin, Uta Störl, Daniel Müller, and Stefanie Scherzinger. REFERENCES 2017. Uncovering the Evolution History of Data Lakes. In Proc. Big Data. 2462– 2471. [1] Lyes Attouche, Mohamed-Amine Baazizi, Dario Colazzo, Francesco Falleni, Gior- [13] Meike Klettke, Uta Störl, and Stefanie Scherzinger. 2015. Schema Extraction and gio Ghelli, Cristiano Landi, Carlo Sartiani, and Stefanie Scherzinger. 2021. A Tool Structural Outlier Detection for JSON-based NoSQL Data Stores. In Proc. BTW. for JSON Schema Witness Generation. In Proc. EDBT. 694–697. 425–444. [2] Mohamed-Amine Baazizi, Dario Colazzo, Giorgio Ghelli, and Carlo Sartiani. 2019. [14] Diego Sevilla Ruiz, Severino Feliciano Morales, and Jesús García Molina. 2015. Parametric schema inference for massive JSON datasets. VLDB J. 28, 4 (2019), Inferring Versioned Schemas from NoSQL Databases and its Applications. In 497–521. Proc. ER. 467–480. [3] Mohamed-Amine Baazizi, Dario Colazzo, Giorgio Ghelli, Carlo Sartiani, and [15] William Spoth, Oliver Kennedy, Ying Lu, Beda Christoph Hammerschmidt, and Stefanie Scherzinger. in-press. An Empirical Study on the "Usage of Not" in Zhen Hua Liu. 2021. Reducing Ambiguity in Json Schema Discovery. In Proc. Real-World JSON Schema Documents. In Proc. ER. SIGMOD. 1732–1744. [4] Pierre Bourhis, Juan L. Reutter, Fernando Suárez, and Domagoj Vrgoc. 2017. JSON: [16] Uta Störl, Daniel Müller, Alexander Tekleab, Stephane Tolale, Julian Stenzel, Data model, Query languages and Schema specification. In Proc. PODS. 123–135. Meike Klettke, and Stefanie Scherzinger. 2018. Curating Variational Data in Application Development. In Proc. ICDE. 1605–1608.