=Paper= {{Paper |id=Vol-2929/poster1 |storemode=property |title=New Workflows in NoSQL Schema Management |pdfUrl=https://ceur-ws.org/Vol-2929/poster1.pdf |volume=Vol-2929 |authors=Michael Fruth,Kai Dauberschmidt,Stefanie Scherzinger |dblpUrl=https://dblp.org/rec/conf/vldb/FruthDS21 }} ==New Workflows in NoSQL Schema Management== https://ceur-ws.org/Vol-2929/poster1.pdf
                     New Workflows in NoSQL Schema Management∗
                   Michael Fruth                                         Kai Dauberschmidt                              Stefanie Scherzinger
              University of Passau                                         University of Passau                          University of Passau
               Passau, Germany                                              Passau, Germany                                Passau, Germany
          michael.fruth@uni-passau.de                                                                            stefanie.scherzinger@uni-passau.de

ABSTRACT                                                                                     Schema extraction & validation. The DevOps team first has to
Many NoSQL document stores allow for flexibility w.r.t. schema                            extract a schema declaration from the persisted data [2, 9, 13–15].
management: For instance, MongoDB allows to switch between                                Often, schema extraction algorithms rely on sampling to cope with
a schema-free and a schema-fixed mode of operation. For declar-                           large data volumes. Consequently, the extracted schema may not
ing such schemas, the JSON Schema language has become highly                              faithfully describe the entire data instance. In order to avoid valida-
popular. We introduce the prototype software Josch, first demoed                          tion errors at runtime, the entire data instance needs to be validated
at ICDE 2021, which enhances the NoSQL schema management                                  against the extracted schema. This impacts database performance.
workflow by integrating novel tools for checking JSON Schema
containment. We point out new research challenges in this context.                           Schema refactoring & containment checking. When the schema
                                                                                          is edited, e.g. adjusting it to account for outlier documents, or
Reference Format:                                                                         restructuring it for better readability, the team risks that the schema
Michael Fruth, Kai Dauberschmidt, and Stefanie Scherzinger. New                           semantics is unintentionally changed. In JSON Schema containment
Workflows in NoSQL Schema Management. In the 2nd Workshop on                              checking, two JSON Schema declarations are compared based on
Search, Exploration, and Analysis in Heterogeneous Datastores (SEA Data                   their semantics. Thus, we can automatically decide whether the
2021).                                                                                    schema semantics has been changed.
                                                                                             For illustration, let us consider two excerpts of JSON Schema
                                                                                          documents that describe the month of a publication, 𝑆1: {"type":
Artifact Availability:
                                                                                          ["number","string"]} and 𝑆2: {"type": ["number"]}. Schema
The source code has been made available online at [7].
                                                                                          𝑆2 is contained in 𝑆1, and therefore more restrictive, as it requires
                                                                                          the month to be numeric, whereas 𝑆1 also allows a string.
1    OVERVIEW
NoSQL document stores such as MongoDB allow to switch between
a schema-free and a schema-fixed mode of operation, by registering                        3   RESEARCH CHALLENGES
a JSON Schema [4, 11] declaration. Apart from solutions for isolated                      We refer to our extended version [6] of this paper for a more de-
tasks, such as extracting a schema declaration from persisted docu-                       tailed discussion of related work. The full workflow just outlined
ments, or validating documents against this schema, there are tools                       is supported by our software prototype Josch [6, 7], where Josch
that combine these steps into comprehensive end-to-end schema                             is geared to (but not limited to) MongoDB, and employs the third-
management workflows (e.g. Hackolade [9] or Darwin [12, 16]).                             party tools jsonsubschema [8] and is-json-schema-subset [10] for
   Towards this family of software products, we contribute a new                          JSON Schema containment checking.
prototype called Josch [6, 7], where we enhance schema manage-                                State-of-the-art JSON Schema containment checkers do not pro-
ment workflows by integrating novel tools for checking JSON                               vide any explanation as to why two schemas differ. As a form of
Schema containment. In interaction with Josch, we identify new re-                        explainability, we may resort to generating a witness document [1],
search challenges for both practitioners and theoreticians working                        i.e., a JSON document that is valid w.r.t. one schema but not the
on search, exploration, and analysis in heterogeneous datastores.                         other. At the moment, this is still a young research field.
                                                                                              Another limitation of current JSON Schema containment check-
2    WORKFLOWS                                                                            ers are negation and recursive references [5]. While negation is
Our application scenario showcases a DevOps team who started ap-                          rarely used in real-world schemas, it can lead to complex schemas [3].
plication development and production operations with a MongoDB                                The extracted schemas tend to be simplistic, yet highly verbose.
backend in schema-free mode. For data quality assurance, the team                         A semi-automated refactoring that automatically extracts and in-
at one point decides to register a JSON Schema declaration with its                       troduces references for repeating structures to alleviate these short-
MongoDB backend, so all writes are validated against this schema.                         comings could prove helpful. Yet both schema refactorization and
                                                                                          the extraction of complex schemas are open research challenges.

∗ An extended version of this work has been presented as a demo at ICDE 2021 [6].

Copyright © 2021 for the individual papers by the papers’ authors. Copyright © 2021
for the volume as a collection by its editors. This volume and its papers are published   4   OUTLOOK
under the Creative Commons License Attribution 4.0 International (CC BY 4.0).             Solutions to the challenges outlined would also find application
Published in the Proceedings of the 2nd Workshop on Search, Exploration, and Anal-
ysis in Heterogeneous Datastores, co-located with VLDB 2021 (August 16-20, 2021,          beyond NoSQL schema management, e.g., in the static validation
Copenhagen, Denmark) on CEUR-WS.org.                                                      of machine learning pipelines, as in the IBM LALE project [8].
ACKNOWLEDGMENTS                                                                             [5] Michael Fruth, Mohamed-Amine Baazizi, Dario Colazzo, Giorgio Ghelli, Carlo
                                                                                                Sartiani, and Stefanie Scherzinger. 2020. Challenges in Checking JSON Schema
We thank Mohamed-Amine Baazizi, Dario Colazzo, Giorgio Ghelli,                                  Containment over Evolving Real-World Schemas. In Proc. EmpER. 220–230.
and Carlo Sartiani for sharing their insights on JSON Schema, Uta                           [6] Michael Fruth, Kai Dauberschmidt, and Stefanie Scherzinger. 2021. Josch: Man-
                                                                                                aging Schemas for NoSQL Document Stores. In Proc. ICDE. 2693–2696.
Störl for her comments on our full version of this paper, and the                           [7] Michael Fruth, Kai Dauberschmidt, and Stefanie Scherzinger. 2021. sdbs-uni-
authors of [8] for assistance in using their tool. We thank Pascal                              p/josch: Josch Version 1.0.0. https://doi.org/10.5281/zenodo.5155117
Desmarets for providing us with an academic Hackolade license,                              [8] Andrew Habib, Avraham Shinnar, Martin Hirzel, and Michael Pradel. 2021. Find-
                                                                                                ing Data Compatibility Bugs with JSON Subschema Checking. In ISSTA. 620–632.
as well as his feedback from the practitioners’ point-of-view.                              [9] Hackolade. online. Hackolade. https://hackolade.com
   This project was supported by the Deutsche Forschungsgemein-                            [10] haggholm. online. is-json-schema-subset. https://github.com/haggholm/is-json-
                                                                                                schema-subset version 1.1.24.
schaft (DFG, German Research Foundation), grant #385808805.                                [11] JSON Schema. online. JSON Schema. https://json-schema.org
                                                                                           [12] Meike Klettke, Hannes Awolin, Uta Störl, Daniel Müller, and Stefanie Scherzinger.
REFERENCES                                                                                      2017. Uncovering the Evolution History of Data Lakes. In Proc. Big Data. 2462–
                                                                                                2471.
 [1] Lyes Attouche, Mohamed-Amine Baazizi, Dario Colazzo, Francesco Falleni, Gior-         [13] Meike Klettke, Uta Störl, and Stefanie Scherzinger. 2015. Schema Extraction and
     gio Ghelli, Cristiano Landi, Carlo Sartiani, and Stefanie Scherzinger. 2021. A Tool        Structural Outlier Detection for JSON-based NoSQL Data Stores. In Proc. BTW.
     for JSON Schema Witness Generation. In Proc. EDBT. 694–697.                                425–444.
 [2] Mohamed-Amine Baazizi, Dario Colazzo, Giorgio Ghelli, and Carlo Sartiani. 2019.       [14] Diego Sevilla Ruiz, Severino Feliciano Morales, and Jesús García Molina. 2015.
     Parametric schema inference for massive JSON datasets. VLDB J. 28, 4 (2019),               Inferring Versioned Schemas from NoSQL Databases and its Applications. In
     497–521.                                                                                   Proc. ER. 467–480.
 [3] Mohamed-Amine Baazizi, Dario Colazzo, Giorgio Ghelli, Carlo Sartiani, and             [15] William Spoth, Oliver Kennedy, Ying Lu, Beda Christoph Hammerschmidt, and
     Stefanie Scherzinger. in-press. An Empirical Study on the "Usage of Not" in                Zhen Hua Liu. 2021. Reducing Ambiguity in Json Schema Discovery. In Proc.
     Real-World JSON Schema Documents. In Proc. ER.                                             SIGMOD. 1732–1744.
 [4] Pierre Bourhis, Juan L. Reutter, Fernando Suárez, and Domagoj Vrgoc. 2017. JSON:      [16] Uta Störl, Daniel Müller, Alexander Tekleab, Stephane Tolale, Julian Stenzel,
     Data model, Query languages and Schema specification. In Proc. PODS. 123–135.              Meike Klettke, and Stefanie Scherzinger. 2018. Curating Variational Data in
                                                                                                Application Development. In Proc. ICDE. 1605–1608.