<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">New Workflows in NoSQL Schema Management *</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Michael</forename><surname>Fruth</surname></persName>
							<email>michael.fruth@uni-passau.de</email>
							<affiliation key="aff0">
								<address>
									<settlement>Passau, Passau</settlement>
									<country key="DE">Germany</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Kai</forename><surname>Dauberschmidt</surname></persName>
							<affiliation key="aff1">
								<orgName type="institution">University of Passau</orgName>
								<address>
									<settlement>Passau</settlement>
									<country key="DE">Germany</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Stefanie</forename><surname>Scherzinger</surname></persName>
							<email>stefanie.scherzinger@uni-passau.de</email>
							<affiliation key="aff2">
								<orgName type="institution">University of Passau</orgName>
								<address>
									<settlement>Passau</settlement>
									<country key="DE">Germany</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">New Workflows in NoSQL Schema Management *</title>
					</analytic>
					<monogr>
						<imprint>
							<date/>
						</imprint>
					</monogr>
					<idno type="MD5">C489FA4A468CA13A53E8965F61B3D1CD</idno>
					<idno type="DOI">10.5281/zenodo.5155117</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-24T08:55+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>Many NoSQL document stores allow for flexibility w.r.t. schema management: For instance, MongoDB allows to switch between a schema-free and a schema-fixed mode of operation. For declaring such schemas, the JSON Schema language has become highly popular. We introduce the prototype software Josch, first demoed at ICDE 2021, which enhances the NoSQL schema management workflow by integrating novel tools for checking JSON Schema containment. We point out new research challenges in this context.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">OVERVIEW</head><p>NoSQL document stores such as MongoDB allow to switch between a schema-free and a schema-fixed mode of operation, by registering a JSON Schema <ref type="bibr" target="#b3">[4,</ref><ref type="bibr" target="#b10">11]</ref> declaration. Apart from solutions for isolated tasks, such as extracting a schema declaration from persisted documents, or validating documents against this schema, there are tools that combine these steps into comprehensive end-to-end schema management workflows (e.g. Hackolade <ref type="bibr" target="#b8">[9]</ref> or Darwin <ref type="bibr" target="#b11">[12,</ref><ref type="bibr" target="#b15">16]</ref>).</p><p>Towards this family of software products, we contribute a new prototype called Josch <ref type="bibr" target="#b5">[6,</ref><ref type="bibr" target="#b6">7]</ref>, where we enhance schema management workflows by integrating novel tools for checking JSON Schema containment. In interaction with Josch, we identify new research challenges for both practitioners and theoreticians working on search, exploration, and analysis in heterogeneous datastores.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">WORKFLOWS</head><p>Our application scenario showcases a DevOps team who started application development and production operations with a MongoDB backend in schema-free mode. For data quality assurance, the team at one point decides to register a JSON Schema declaration with its MongoDB backend, so all writes are validated against this schema. * Schema extraction &amp; validation. The DevOps team first has to extract a schema declaration from the persisted data <ref type="bibr" target="#b1">[2,</ref><ref type="bibr" target="#b8">9,</ref><ref type="bibr" target="#b12">[13]</ref><ref type="bibr" target="#b13">[14]</ref><ref type="bibr" target="#b14">[15]</ref>. Often, schema extraction algorithms rely on sampling to cope with large data volumes. Consequently, the extracted schema may not faithfully describe the entire data instance. In order to avoid validation errors at runtime, the entire data instance needs to be validated against the extracted schema. This impacts database performance.</p><p>Schema refactoring &amp; containment checking. When the schema is edited, e.g. adjusting it to account for outlier documents, or restructuring it for better readability, the team risks that the schema semantics is unintentionally changed. In JSON Schema containment checking, two JSON Schema declarations are compared based on their semantics. Thus, we can automatically decide whether the schema semantics has been changed.</p><p>For illustration, let us consider two excerpts of JSON Schema documents that describe the month of a publication, 𝑆1: {"type": ["number","string"]} and 𝑆2: {"type": ["number"]}. Schema 𝑆2 is contained in 𝑆1, and therefore more restrictive, as it requires the month to be numeric, whereas 𝑆1 also allows a string.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">RESEARCH CHALLENGES</head><p>We refer to our extended version <ref type="bibr" target="#b5">[6]</ref> of this paper for a more detailed discussion of related work. The full workflow just outlined is supported by our software prototype Josch <ref type="bibr" target="#b5">[6,</ref><ref type="bibr" target="#b6">7]</ref>, where Josch is geared to (but not limited to) MongoDB, and employs the thirdparty tools jsonsubschema <ref type="bibr" target="#b7">[8]</ref> and is-json-schema-subset <ref type="bibr" target="#b9">[10]</ref> for JSON Schema containment checking.</p><p>State-of-the-art JSON Schema containment checkers do not provide any explanation as to why two schemas differ. As a form of explainability, we may resort to generating a witness document <ref type="bibr" target="#b0">[1]</ref>, i.e., a JSON document that is valid w.r.t. one schema but not the other. At the moment, this is still a young research field.</p><p>Another limitation of current JSON Schema containment checkers are negation and recursive references <ref type="bibr" target="#b4">[5]</ref>. While negation is rarely used in real-world schemas, it can lead to complex schemas <ref type="bibr" target="#b2">[3]</ref>.</p><p>The extracted schemas tend to be simplistic, yet highly verbose. A semi-automated refactoring that automatically extracts and introduces references for repeating structures to alleviate these shortcomings could prove helpful. Yet both schema refactorization and the extraction of complex schemas are open research challenges.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">OUTLOOK</head><p>Solutions to the challenges outlined would also find application beyond NoSQL schema management, e.g., in the static validation of machine learning pipelines, as in the IBM LALE project <ref type="bibr" target="#b7">[8]</ref>.</p></div>		</body>
		<back>

			<div type="acknowledgement">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>ACKNOWLEDGMENTS</head><p>We thank Mohamed-Amine Baazizi, Dario Colazzo, Giorgio Ghelli, and Carlo Sartiani for sharing their insights on JSON Schema, Uta Störl for her comments on our full version of this paper, and the authors of <ref type="bibr" target="#b7">[8]</ref> for assistance in using their tool. We thank Pascal Desmarets for providing us with an academic Hackolade license, as well as his feedback from the practitioners' point-of-view.</p><p>This project was supported by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation), grant #385808805.</p></div>
			</div>

			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">A Tool for JSON Schema Witness Generation</title>
		<author>
			<persName><forename type="first">Lyes</forename><surname>Attouche</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Mohamed-Amine</forename><surname>Baazizi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Dario</forename><surname>Colazzo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Francesco</forename><surname>Falleni</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Giorgio</forename><surname>Ghelli</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Cristiano</forename><surname>Landi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Carlo</forename><surname>Sartiani</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Stefanie</forename><surname>Scherzinger</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proc. EDBT</title>
				<meeting>EDBT</meeting>
		<imprint>
			<date type="published" when="2021">2021</date>
			<biblScope unit="page" from="694" to="697" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">Parametric schema inference for massive JSON datasets</title>
		<author>
			<persName><forename type="first">Mohamed-Amine</forename><surname>Baazizi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Dario</forename><surname>Colazzo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Giorgio</forename><surname>Ghelli</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Carlo</forename><surname>Sartiani</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">VLDB J</title>
		<imprint>
			<biblScope unit="volume">28</biblScope>
			<biblScope unit="issue">4</biblScope>
			<biblScope unit="page" from="497" to="521" />
			<date type="published" when="2019">2019. 2019</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">in-press. An Empirical Study on the &quot;Usage of Not&quot; in Real-World JSON Schema Documents</title>
		<author>
			<persName><forename type="first">Mohamed-Amine</forename><surname>Baazizi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Dario</forename><surname>Colazzo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Giorgio</forename><surname>Ghelli</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Carlo</forename><surname>Sartiani</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Stefanie</forename><surname>Scherzinger</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proc. ER</title>
				<meeting>ER</meeting>
		<imprint/>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">JSON: Data model, Query languages and Schema specification</title>
		<author>
			<persName><forename type="first">Pierre</forename><surname>Bourhis</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Juan</forename><forename type="middle">L</forename><surname>Reutter</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Fernando</forename><surname>Suárez</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Domagoj</forename><surname>Vrgoc</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proc. PODS</title>
				<meeting>PODS</meeting>
		<imprint>
			<date type="published" when="2017">2017</date>
			<biblScope unit="page" from="123" to="135" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">Challenges in Checking JSON Schema Containment over Evolving Real-World Schemas</title>
		<author>
			<persName><forename type="first">Michael</forename><surname>Fruth</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Mohamed-Amine</forename><surname>Baazizi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Dario</forename><surname>Colazzo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Giorgio</forename><surname>Ghelli</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Carlo</forename><surname>Sartiani</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Stefanie</forename><surname>Scherzinger</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proc. EmpER</title>
				<meeting>EmpER</meeting>
		<imprint>
			<date type="published" when="2020">2020</date>
			<biblScope unit="page" from="220" to="230" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">Josch: Managing Schemas for NoSQL Document Stores</title>
		<author>
			<persName><forename type="first">Michael</forename><surname>Fruth</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Kai</forename><surname>Dauberschmidt</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Stefanie</forename><surname>Scherzinger</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proc. ICDE</title>
				<meeting>ICDE</meeting>
		<imprint>
			<date type="published" when="2021">2021</date>
			<biblScope unit="page" from="2693" to="2696" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">sdbs-unip/josch</title>
		<author>
			<persName><forename type="first">Michael</forename><surname>Fruth</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Kai</forename><surname>Dauberschmidt</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Stefanie</forename><surname>Scherzinger</surname></persName>
		</author>
		<idno type="DOI">10.5281/zenodo.5155117</idno>
		<ptr target="https://doi.org/10.5281/zenodo.5155117" />
	</analytic>
	<monogr>
		<title level="j">Josch Version</title>
		<imprint>
			<biblScope unit="volume">1</biblScope>
			<biblScope unit="issue">0</biblScope>
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">Finding Data Compatibility Bugs with JSON Subschema Checking</title>
		<author>
			<persName><forename type="first">Andrew</forename><surname>Habib</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Avraham</forename><surname>Shinnar</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Martin</forename><surname>Hirzel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Michael</forename><surname>Pradel</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">ISSTA</title>
		<imprint>
			<biblScope unit="page" from="620" to="632" />
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<monogr>
		<author>
			<persName><surname>Hackolade</surname></persName>
		</author>
		<ptr target="https://hackolade.com" />
		<title level="m">Hackolade</title>
				<imprint/>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<monogr>
		<author>
			<persName><surname>Haggholm</surname></persName>
		</author>
		<ptr target="https://github.com/haggholm/is-json-schema-subsetversion1.1.24" />
		<title level="m">is-json-schema-subset</title>
				<imprint/>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<monogr>
		<ptr target="https://json-schema.org" />
		<title level="m">JSON Schema</title>
				<imprint/>
	</monogr>
</biblStruct>

<biblStruct xml:id="b11">
	<analytic>
		<title level="a" type="main">Uncovering the Evolution History of Data Lakes</title>
		<author>
			<persName><forename type="first">Meike</forename><surname>Klettke</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Hannes</forename><surname>Awolin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Uta</forename><surname>Störl</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Daniel</forename><surname>Müller</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Stefanie</forename><surname>Scherzinger</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proc. Big Data</title>
				<meeting>Big Data</meeting>
		<imprint>
			<date type="published" when="2017">2017</date>
			<biblScope unit="page" from="2462" to="2471" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b12">
	<analytic>
		<title level="a" type="main">Schema Extraction and Structural Outlier Detection for JSON-based NoSQL Data Stores</title>
		<author>
			<persName><forename type="first">Meike</forename><surname>Klettke</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Uta</forename><surname>Störl</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Stefanie</forename><surname>Scherzinger</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proc. BTW</title>
				<meeting>BTW</meeting>
		<imprint>
			<date type="published" when="2015">2015</date>
			<biblScope unit="page" from="425" to="444" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b13">
	<analytic>
		<title level="a" type="main">Inferring Versioned Schemas from NoSQL Databases and its Applications</title>
		<author>
			<persName><forename type="first">Diego</forename><surname>Sevilla Ruiz</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Severino</forename><surname>Feliciano Morales</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Jesús García</forename><surname>Molina</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proc. ER</title>
				<meeting>ER</meeting>
		<imprint>
			<date type="published" when="2015">2015</date>
			<biblScope unit="page" from="467" to="480" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b14">
	<analytic>
		<title level="a" type="main">Reducing Ambiguity in Json Schema Discovery</title>
		<author>
			<persName><forename type="first">William</forename><surname>Spoth</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Oliver</forename><surname>Kennedy</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Ying</forename><surname>Lu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Beda</forename></persName>
		</author>
		<author>
			<persName><forename type="first">Christoph</forename><surname>Hammerschmidt</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Zhen Hua</forename><surname>Liu</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proc. SIGMOD</title>
				<meeting>SIGMOD</meeting>
		<imprint>
			<date type="published" when="2021">2021</date>
			<biblScope unit="page" from="1732" to="1744" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b15">
	<analytic>
		<title level="a" type="main">Curating Variational Data in Application Development</title>
		<author>
			<persName><forename type="first">Uta</forename><surname>Störl</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Daniel</forename><surname>Müller</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Alexander</forename><surname>Tekleab</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Stephane</forename><surname>Tolale</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Julian</forename><surname>Stenzel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Meike</forename><surname>Klettke</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Stefanie</forename><surname>Scherzinger</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proc. ICDE</title>
				<meeting>ICDE</meeting>
		<imprint>
			<date type="published" when="2018">2018</date>
			<biblScope unit="page" from="1605" to="1608" />
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
