<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>New Workflows in NoSQL Schema Management∗</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Michael Fruth</string-name>
          <email>michael.fruth@uni-passau.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Kai Dauberschmidt</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Stefanie Scherzinger</string-name>
          <email>stefanie.scherzinger@uni-passau.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Passau</institution>
          ,
          <addr-line>Passau</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2021</year>
      </pub-date>
      <abstract>
        <p>Many NoSQL document stores allow for flexibility w.r.t. schema management: For instance, MongoDB allows to switch between a schema-free and a schema-fixed mode of operation. For declaring such schemas, the JSON Schema language has become highly popular. We introduce the prototype software Josch, first demoed at ICDE 2021, which enhances the NoSQL schema management workflow by integrating novel tools for checking JSON Schema containment. We point out new research challenges in this context.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Artifact Availability:
The source code has been made available online at [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ].
      </p>
    </sec>
    <sec id="sec-2">
      <title>OVERVIEW</title>
      <p>
        NoSQL document stores such as MongoDB allow to switch between
a schema-free and a schema-fixed mode of operation, by registering
a JSON Schema [
        <xref ref-type="bibr" rid="ref11 ref4">4, 11</xref>
        ] declaration. Apart from solutions for isolated
tasks, such as extracting a schema declaration from persisted
documents, or validating documents against this schema, there are tools
that combine these steps into comprehensive end-to-end schema
management workflows (e.g. Hackolade [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] or Darwin [
        <xref ref-type="bibr" rid="ref12 ref16">12, 16</xref>
        ]).
      </p>
      <p>
        Towards this family of software products, we contribute a new
prototype called Josch [
        <xref ref-type="bibr" rid="ref6 ref7">6, 7</xref>
        ], where we enhance schema
management workflows by integrating novel tools for checking JSON
Schema containment. In interaction with Josch, we identify new
research challenges for both practitioners and theoreticians working
on search, exploration, and analysis in heterogeneous datastores.
      </p>
    </sec>
    <sec id="sec-3">
      <title>WORKFLOWS</title>
      <p>
        Our application scenario showcases a DevOps team who started
application development and production operations with a MongoDB
backend in schema-free mode. For data quality assurance, the team
at one point decides to register a JSON Schema declaration with its
MongoDB backend, so all writes are validated against this schema.
∗An extended version of this work has been presented as a demo at ICDE 2021 [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
Copyright © 2021 for the individual papers by the papers’ authors. Copyright © 2021
for the volume as a collection by its editors. This volume and its papers are published
under the Creative Commons License Attribution 4.0 International (CC BY 4.0).
Published in the Proceedings of the 2nd Workshop on Search, Exploration, and
Analysis in Heterogeneous Datastores, co-located with VLDB 2021 (August 16-20, 2021,
Copenhagen, Denmark) on CEUR-WS.org.
      </p>
      <p>
        Schema extraction &amp; validation. The DevOps team first has to
extract a schema declaration from the persisted data [
        <xref ref-type="bibr" rid="ref13 ref14 ref15 ref2 ref9">2, 9, 13–15</xref>
        ].
Often, schema extraction algorithms rely on sampling to cope with
large data volumes. Consequently, the extracted schema may not
faithfully describe the entire data instance. In order to avoid
validation errors at runtime, the entire data instance needs to be validated
against the extracted schema. This impacts database performance.
      </p>
      <p>Schema refactoring &amp; containment checking. When the schema
is edited, e.g. adjusting it to account for outlier documents, or
restructuring it for better readability, the team risks that the schema
semantics is unintentionally changed. In JSON Schema containment
checking, two JSON Schema declarations are compared based on
their semantics. Thus, we can automatically decide whether the
schema semantics has been changed.</p>
      <p>For illustration, let us consider two excerpts of JSON Schema
documents that describe the month of a publication, 1: {"type":
["number","string"]} and 2: {"type": ["number"]}. Schema
2 is contained in 1, and therefore more restrictive, as it requires
the month to be numeric, whereas 1 also allows a string.
3</p>
    </sec>
    <sec id="sec-4">
      <title>RESEARCH CHALLENGES</title>
      <p>
        We refer to our extended version [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] of this paper for a more
detailed discussion of related work. The full workflow just outlined
is supported by our software prototype Josch [
        <xref ref-type="bibr" rid="ref6 ref7">6, 7</xref>
        ], where Josch
is geared to (but not limited to) MongoDB, and employs the
thirdparty tools jsonsubschema [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] and is-json-schema-subset [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] for
JSON Schema containment checking.
      </p>
      <p>
        State-of-the-art JSON Schema containment checkers do not
provide any explanation as to why two schemas difer. As a form of
explainability, we may resort to generating a witness document [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ],
i.e., a JSON document that is valid w.r.t. one schema but not the
other. At the moment, this is still a young research field.
      </p>
      <p>
        Another limitation of current JSON Schema containment
checkers are negation and recursive references [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. While negation is
rarely used in real-world schemas, it can lead to complex schemas [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
      </p>
      <p>The extracted schemas tend to be simplistic, yet highly verbose.
A semi-automated refactoring that automatically extracts and
introduces references for repeating structures to alleviate these
shortcomings could prove helpful. Yet both schema refactorization and
the extraction of complex schemas are open research challenges.
4</p>
    </sec>
    <sec id="sec-5">
      <title>OUTLOOK</title>
      <p>
        Solutions to the challenges outlined would also find application
beyond NoSQL schema management, e.g., in the static validation
of machine learning pipelines, as in the IBM LALE project [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ].
      </p>
    </sec>
    <sec id="sec-6">
      <title>ACKNOWLEDGMENTS</title>
      <p>
        We thank Mohamed-Amine Baazizi, Dario Colazzo, Giorgio Ghelli,
and Carlo Sartiani for sharing their insights on JSON Schema, Uta
Störl for her comments on our full version of this paper, and the
authors of [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] for assistance in using their tool. We thank Pascal
Desmarets for providing us with an academic Hackolade license,
as well as his feedback from the practitioners’ point-of-view.
      </p>
      <p>This project was supported by the Deutsche
Forschungsgemeinschaft (DFG, German Research Foundation), grant #385808805.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Lyes</given-names>
            <surname>Attouche</surname>
          </string-name>
          ,
          <string-name>
            <surname>Mohamed-Amine</surname>
            <given-names>Baazizi</given-names>
          </string-name>
          , Dario Colazzo, Francesco Falleni, Giorgio Ghelli, Cristiano Landi, Carlo Sartiani, and
          <string-name>
            <given-names>Stefanie</given-names>
            <surname>Scherzinger</surname>
          </string-name>
          .
          <year>2021</year>
          .
          <article-title>A Tool for JSON Schema Witness Generation</article-title>
          .
          <source>In Proc. EDBT</source>
          .
          <volume>694</volume>
          -
          <fpage>697</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Mohamed-Amine</surname>
            <given-names>Baazizi</given-names>
          </string-name>
          , Dario Colazzo, Giorgio Ghelli, and
          <string-name>
            <given-names>Carlo</given-names>
            <surname>Sartiani</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Parametric schema inference for massive JSON datasets</article-title>
          .
          <source>VLDB J</source>
          .
          <volume>28</volume>
          ,
          <issue>4</issue>
          (
          <year>2019</year>
          ),
          <fpage>497</fpage>
          -
          <lpage>521</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Mohamed-Amine</surname>
            <given-names>Baazizi</given-names>
          </string-name>
          , Dario Colazzo, Giorgio Ghelli, Carlo Sartiani, and
          <article-title>Stefanie Scherzinger</article-title>
          . in-press.
          <article-title>An Empirical Study on the "Usage of Not" in Real-World JSON Schema Documents</article-title>
          .
          <source>In Proc. ER</source>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Pierre</given-names>
            <surname>Bourhis</surname>
          </string-name>
          , Juan L. Reutter, Fernando Suárez, and
          <string-name>
            <given-names>Domagoj</given-names>
            <surname>Vrgoc</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>JSON: Data model, Query languages and Schema specification</article-title>
          .
          <source>In Proc. PODS</source>
          .
          <volume>123</volume>
          -
          <fpage>135</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Michael</given-names>
            <surname>Fruth</surname>
          </string-name>
          ,
          <string-name>
            <surname>Mohamed-Amine</surname>
            <given-names>Baazizi</given-names>
          </string-name>
          , Dario Colazzo, Giorgio Ghelli, Carlo Sartiani, and
          <string-name>
            <given-names>Stefanie</given-names>
            <surname>Scherzinger</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>Challenges in Checking JSON Schema Containment over Evolving Real-World Schemas</article-title>
          .
          <source>In Proc. EmpER</source>
          .
          <volume>220</volume>
          -
          <fpage>230</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Michael</given-names>
            <surname>Fruth</surname>
          </string-name>
          , Kai Dauberschmidt, and
          <string-name>
            <given-names>Stefanie</given-names>
            <surname>Scherzinger</surname>
          </string-name>
          .
          <year>2021</year>
          .
          <article-title>Josch: Managing Schemas for NoSQL Document Stores</article-title>
          .
          <source>In Proc. ICDE</source>
          .
          <volume>2693</volume>
          -
          <fpage>2696</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Michael</given-names>
            <surname>Fruth</surname>
          </string-name>
          , Kai Dauberschmidt, and
          <string-name>
            <given-names>Stefanie</given-names>
            <surname>Scherzinger</surname>
          </string-name>
          .
          <year>2021</year>
          .
          <article-title>sdbs-unip/josch:</article-title>
          <source>Josch Version 1.0</source>
          .0. https://doi.org/10.5281/zenodo.5155117
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Andrew</given-names>
            <surname>Habib</surname>
          </string-name>
          , Avraham Shinnar,
          <string-name>
            <given-names>Martin</given-names>
            <surname>Hirzel</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Michael</given-names>
            <surname>Pradel</surname>
          </string-name>
          .
          <year>2021</year>
          .
          <article-title>Finding Data Compatibility Bugs with JSON Subschema Checking</article-title>
          . In ISSTA.
          <volume>620</volume>
          -
          <fpage>632</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <surname>Hackolade</surname>
          </string-name>
          . online. Hackolade. https://hackolade.com
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <article-title>[10] haggholm. online. is-json-schema-subset</article-title>
          . https://github.com/haggholm/is-jsonschema
          <source>-subset version 1</source>
          .1.24.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>JSON</given-names>
            <surname>Schema</surname>
          </string-name>
          . online. JSON Schema. https://json-schema.org
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Meike</surname>
            <given-names>Klettke</given-names>
          </string-name>
          , Hannes Awolin, Uta Störl, Daniel Müller, and
          <string-name>
            <given-names>Stefanie</given-names>
            <surname>Scherzinger</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Uncovering the Evolution History of Data Lakes</article-title>
          .
          <source>In Proc. Big Data</source>
          .
          <fpage>2462</fpage>
          -
          <lpage>2471</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>Meike</surname>
            <given-names>Klettke</given-names>
          </string-name>
          , Uta Störl, and
          <string-name>
            <given-names>Stefanie</given-names>
            <surname>Scherzinger</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>Schema Extraction and Structural Outlier Detection for JSON-based NoSQL Data Stores</article-title>
          .
          <source>In Proc. BTW</source>
          .
          <volume>425</volume>
          -
          <fpage>444</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14] Diego Sevilla Ruiz, Severino Feliciano Morales, and Jesús García Molina.
          <year>2015</year>
          .
          <article-title>Inferring Versioned Schemas from NoSQL Databases and its Applications</article-title>
          .
          <source>In Proc. ER</source>
          .
          <volume>467</volume>
          -
          <fpage>480</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>William</given-names>
            <surname>Spoth</surname>
          </string-name>
          , Oliver Kennedy, Ying Lu, Beda Christoph Hammerschmidt, and Zhen Hua Liu.
          <year>2021</year>
          .
          <article-title>Reducing Ambiguity in Json Schema Discovery</article-title>
          .
          <source>In Proc. SIGMOD</source>
          .
          <volume>1732</volume>
          -
          <fpage>1744</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <surname>Uta</surname>
            <given-names>Störl</given-names>
          </string-name>
          , Daniel Müller, Alexander Tekleab, Stephane Tolale, Julian Stenzel, Meike Klettke, and
          <string-name>
            <given-names>Stefanie</given-names>
            <surname>Scherzinger</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Curating Variational Data in Application Development</article-title>
          .
          <source>In Proc. ICDE</source>
          .
          <volume>1605</volume>
          -
          <fpage>1608</fpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>