Impact of read and write operations on NoSQL
                                schema design: First insights
                                André Conrad1,* , Uta Störl1
                                1
                                    University of Hagen, Universitätsstraße 47, 58097 Hagen, Germany


                                                                         Abstract
                                                                         Document stores like MongoDB are among the most popular NoSQL data stores. Due to the nested
                                                                         document structure and possible redundancy, a large amount of alternative modeling possibilities arise
                                                                         in contrast to relational data stores, even for simple applications. Therefore, designing an optimal
                                                                         schema is a complex task. In [1] we presented a visionary approach to automatic schema migration and
                                                                         optimization for migrating relational data stores to various NoSQL stores. This is essentially based on
                                                                         an in-depth data and workload analysis. A big challenge, however, is to find the optimal compromise
                                                                         between read and write optimization. In this paper, we show the impact of different workload profiles
                                                                         (mix of read and write operations) on the schema design using initial measurements. This illustrates the
                                                                         fundamental need to consider realistic workload profiles in the schema design and optimization process.

                                                                         Keywords
                                                                         Document Store, NoSQL Schema Design, Database Benchmark, Optimization


                                1. Introduction
                                NoSQL data stores such as MongoDB are gaining in popularity. However, designing a suitable
                                schema is a major challenge due to the large number of modeling alternatives, which is often
                                based on a trial and error approach. Thus, there are several works that address automatic
                                schema optimization and design, respectively [1, 2, 3, 4].
                                  For the evaluation of automatically generated, but also manually created schemas, it is
                                therefore absolutely necessary to use realistic workload profiles in order to be able to estimate
                                the behavior in a productive environment.
                                  Contribution: First measurements showing the different behavior of realistic workload profiles
                                compared to single execution (one by one) of queries.
                                This paper is organized as follows: Section 2 presents work discussing the evaluation of different
                                schema modeling and optimization approaches. Section 3 describes the environment and the
                                results of our first measurements. Section 4 provides the conclusion as well as an outlook to
                                our future work.


                                ER2023: Companion Proceedings of the 42nd International Conference on Conceptual Modeling: ER Forum, 7th SCME,
                                Project Exhibitions, Posters and Demos, and Doctoral Consortium, November 06–09, 2023, Lisbon, Portugal
                                *
                                  Corresponding author.
                                " andre.conrad@fernuni-hagen.de (A. Conrad); uta.stoerl@fernuni-hagen.de (U. Störl)
                                 0000-0001-6681-2798 (A. Conrad); 0000-0003-2771-142X (U. Störl)
                                                                       © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
                                    CEUR
                                    Workshop
                                    Proceedings
                                                  http://ceur-ws.org
                                                  ISSN 1613-0073
                                                                       CEUR Workshop Proceedings (CEUR-WS.org)


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
2. Related work
In the following, related work on the evaluation of schema modeling and optimization ap-
proaches is described. Since the focus of this paper is on the evaluation and the necessity of
considering realistic workload profiles, only papers regarding the evaluation of NoSQL schema
design methods are considered.
    In [5] an e-commerce scenario is used. They define 6 different access operations. The workload
profile has been described in detail, but it is not clearly defined if a concurrent workload with
multiple threads (clients) is used. In [6] the Rice University Bidding System (RUBiS), a web
application benchmark, was adapted from [7]. It consists of a read-only workload with 7 queries
and a mixed workload with 9 additional write queries. Although the measurements were
originally performed with different numbers of threads (clients), this is not discussed in detail.
In [2] and [4], the RUBiS scenario is also used, but again without going into detail about the
measurements (number of threads) and only with read queries. The authors of [3] use the
e-commerce and RUBiS scenario for comparison with other work, but also without going into
detail about how measurements are made with respect to the different workload profiles and
number of threads (clients).
    None of the works describes in detail how the measurements were made with respect to the
workload. In particular, the number of concurrent threads (clients) is not discussed. Therefore,
it is difficult to understand exactly what the benchmark environment looks like.


3. Experimental environment and results
The goal of this work is to show the impact of executing queries one by one compared to different
workload mixes of read and write queries on multiple clients. This is done using two different
schema designs. One has been optimized for read-only queries and one for write-only queries.
   The schema of a prototypical application1 to manage digital print media is used because it
is a real world scenario that is not too complex but has sufficiently complex relationships to
illustrate the impact on the two schema designs (see Figure 1).
   The test data was generated using our test data generator [8]. Figure 1 shows the conceptual
model of the application.


                Author         Publisher        Item             MetaData            Page         Collection

           id            pk   id      pk   id          pk   id              pk   id         pk   id      pk
           first_name         name                          title                content         name
           last_name                                        release_date         size
           birthday


Figure 1: Conceptual model of the application (the optimized physical models are available here:
https://zenodo.org/doi/10.5281/zenodo.10009801).

  For modeling the physical schemas and defining different workload mixes, 3 read and 3 write
queries are used: (𝑅1 ) Given a Collection.id, return the Collection, related Items, Metadata,
1
    https://sammlungen.hebis.de/Handbuecher/
Publisher and Authors. (𝑅2 ) Given an Author.id, return the Author, related Items, Publisher
and Collections. (𝑅3 ) Given an Item.id, return the Item related Metadata, Pages, Publisher
and Authors. (𝑊4 ) Given an existing Publisher.id and multiple existing Collection.ids, add one
or more new Authors and a new Item with corresponding Metadata and Pages. (𝑊5 ) Given
multiple existing Item.ids and Collection.ids, add the Items to the Collections. (𝑊6 ) Given an
existing Item.id and changed Metadata.title, update the Metadata.
   Due to the various relationship modeling possibilities, designing the optimal schema for a
given workload in document stores such as MongoDB is a complex task.
   Here, optimizations with respect to reading queries result in redundancy in the Collection,
Item and Author documents so that all attributes needed for the queries are present in the
corresponding documents. This increases the complexity of the write queries (𝑊4 to 𝑊6 ) since
redundant data must be written in several collections. To avoid inconsistencies, transactions
were used for the write operations in the read-optimized schema.
   For the optimization of the write queries, redundancy must be avoided, since the performance
decreases the more redundant data has to be written. Here, only 𝑊5 is realized as a transaction,
since several documents are updated, which ensures a rollback in the case of an error.
   All measurements were done on a single node of MongoDB version 6.0.9 with 64 GB RAM and
16 CPU cores. For the benchmark, py-tpcc 2 , a Python implementation of the TPC-C3 benchmark
was extended, which already has a MongoDB driver as part of performance measurements on
multi-document transactions [9]. For the measurements regarding workload mixes different
probabilities of the queries are used (from 99% read and 1% write to 1% read and 99% write).
r90_w10, for example, means 90% read (𝑅1 , 𝑅2 or 𝑅3 ) and 10% write (𝑊4 , 𝑊5 or 𝑊6 ).
   Results: First, it can be observed that for sequential execution (single thread) of queries, the
read queries in the read-optimized schema and the write queries in the write-optimized schema
always have the best performance (see 𝑅3 and 𝑊6 one_by_one in Figure 2 as an example).
   However, this may be different for individual queries when running mixed workloads on
multiple concurrent clients. For example, query 𝑊6 (write) performs better in the read-optimized
schema than in the write-optimized schema for mixed workloads (see Figure 2b).
                       30,000                                                                       30,000
                                                                                   Queries / Min.
      Queries / Min.


                       20,000                                                                       20,000

                       10,000                                                                       10,000

                           0                                                                            0
                                     Read Optimized              Write Optimized                                  Read Optimized              Write Optimized

                                one_by_one   r90_w10   r50_w50    r10_w90                                    one_by_one   r90_w10   r50_w50    r10_w90

                                        (a) Query 𝑅3 (read).                                                        (b) Query 𝑊6 (write).
Figure 2: Execution one by one (single thread) and 3 workload profiles with 32 threads (clients).


   Finally, Figure 3 shows the behavior of different workload mixes on the read and write
optimized schemas. Here, as expected, the correspondingly optimized schemas show better per-
formance for a higher amount of read or write operations. For better readability the plot has been
split, since the performance is much higher in the read-optimized schema for workload r99_w1.

2
    https://github.com/apavlo/py-tpcc
3
    https://tpc.org/tpcc/
                   15,000                                  2,000

  Queries / Min.


                                          Queries / Min.
                                                           1,500
                   10,000
                                                           1,000
                    5,000
                                                            500
                        0                                     0
                       r99_w1   r80_w20                              r80_w20       r60_w40          r40_w60    r20_w80   r1_w99

                                                           Read Optimized Schema      Write Optimized Schema


Figure 3: Execution of different workload profiles (mixes) with 32 threads (clients).


4. Conclusion and future work
Based on initial measurements, it could be shown that when looking at the execution perfor-
mance of individual queries, significant differences exist between queries executed sequentially
(one by one) or, in a more realistic scenario, on several concurrent clients in a workload mix.
   This shows the importance to consider realistic workload profiles (mixes) with concurrent
clients when evaluating schema optimization approaches.
   There are many dimensions to be taken into account when executing database benchmarks.
We therefore plan to perform more extensive measurements. Our current vision is to use a
MongoDB cluster with replication and sharding. We also plan to further examine important
variables and their effects, such as the different number of clients, different document sizes, and
different number of related documents.


Acknowledgments
This work has been funded by Deutsche Forschungsgemeinschaft (DFG, German Research Foun-
dation) grant #385808805.


References
[1] A. Conrad, et al., Towards Automated Schema Optimization, in: ER’21, 2021.
[2] V. Reniers, et al., A Workload-Driven Document Database Schema Recommender (DBSR),
    in: ER’20, 2020.
[3] L. Chen, et al., A workload-driven method for designing aggregate-oriented NoSQL
    databases, DKE’22 (2022).
[4] M. Hewasinghage, et al., Automated database design for document stores with multicriteria
    optimization, KAIS’23 (2023).
[5] C. de Lima, et al., On proposing and evaluating a NoSQL document database logical
    approach, IJWIS’16 (2016).
[6] M. J. Mior, et al., NoSE: Schema design for NoSQL applications, in: ICDE’16, 2016.
[7] E. Cecchet, et al., Performance and scalability of EJB applications, in: OOPSLA’02, 2002.
[8] V. Restat, et al., GouDa - generation of universal data sets: improving analysis and evaluation
    of data preparation pipelines, in: DEEM’22, 2022.
[9] A. Kamsky, Adapting TPC-C Benchmark to Measure Performance of Multi-Document
    Transactions in MongoDB, VLDB’19 (2019).