MMSBench-Net: Scenario-Based Evaluation of Multi-Model Database Systems David Lengweiler1 , Marco Vogt1 and Heiko Schuldt1 1 University of Basel, Department of Mathematics and Computer Science, Spiegelgasse 1, 4051 Basel, Switzerland Abstract Multi-model database systems have gained increasing popularity due to their efficient management of diverse types of data and support for complex queries. They offer a unified approach for managing data in various formats, including structured, semi-structured, and unstructured data. However, benchmarking the performance of such systems is a challenging task, given their complexity, mainly due to their support for multiple data models. While significant research exists for benchmarking single-model databases, a comprehensive approach for evaluating multi-model databases is still in an early stage. To address this challenge, we propose MMSBench-Net, a benchmark for evaluating multi-model database systems that support structured relational, semi-structured document, and graph data models. MMSBench-Net enables comparative analysis of database systems and demonstrates how different workloads can reveal the strengths and weaknesses of multi-model database systems. To demonstrate the effectiveness of the benchmark, we compare the performance of two database systems: Polypheny and SurrealDB. Our work is a first step towards a comprehensive evaluation methodology for multi-model database systems. Keywords Database benchmark, Polystore, Multi-model database 1. Introduction load adjustments, limiting their usefulness for detailed evaluations and only allowing for broad comparisons. The field of data management has experienced a signif- This paper makes two contributions: Firstly, we pro- icant transformation in recent years. While relational pose a benchmark called MMSBench-Net that is tailored databases continue to dominate the market, more spe- to benchmarking multi-model database systems. It is cialized systems have emerged. Two data models that based on a real-world scenario that deals with relational, have gained substantial popularity are the graph and the document and graph data. Secondly, we demonstrate the document model. These data models allow data to be utility of our benchmark by comparing the performance represented and queried unknown from the relational of two multi-model database systems, Polypheny1 and model [1]. However, these new data models are by no SurrealDB2 and discuss the results. means an evolution of the relational model. As a result, The remainder of this paper is structured as follows: use cases that could be modeled optimally with the re- In Section 2, we introduce the MMSBench-Net, discuss lational model might only be modeled poorly with the the underlying scenario and present the data, and work- graph or the document model. As a result, database load that is being generated. In Section 3, we then briefly management systems supporting multiple data models introduce the two multi-model database systems subject have gained popularity. These multi-model database sys- to the benchmark evaluation presented in this paper. Sec- tems allow applications to manage their data in a way tion 4 then presents and discusses the obtained results. that best suits the specific domains, but also introduce The paper concludes with an overview of related work greater complexity. While there are well-established in Section 5, an outlook towards future work in Section 6 benchmarks like TPC-C [2], TPC-H [3] and YCSB [4] for and a conclusion in Section 7. single-model databases, the set of benchmarks targeting multi-model databases is very limited. Existing bench- marks for multi-model databases often focus on specific 2. Benchmark data models, which restricts the range of systems that can be evaluated. Moreover, these benchmarks typically To evaluate the performance of multi-model database involve complex scenarios that lack fine-grained work- systems, we propose MMSBench-Net, a benchmark that assesses their ability to manage structured relational, 34th GI-Workshop on Foundations of Databases (Grundlagen von Daten- semi-structured document, and graph data models. MMS- banken), June 7-9, 2023, Hirsau, Germany Bench-Net is designed to evaluate the efficiency and ver- Envelope-Open david.lengweiler@unibas.ch (D. Lengweiler); satility of multi-model database systems under different marco.vogt@unibas.ch (M. Vogt); heiko.schuldt@unibas.ch (H. Schuldt) workloads. The “-Net” suffix refers to the first scenario Orcid 0009-0004-0588-8210 (D. Lengweiler); 0000-0002-2674-2219 (M. Vogt); 0000-0001-9865-6371 (H. Schuldt) 1 https://polypheny.com/ © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License 2 Attribution 4.0 International (CC BY 4.0). https://surrealdb.com/ CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings Document Graph Relational Logs Login User [Device] deviceId accesstime PK id PK error: optional [Connection] deviceId firstname ↳ message [Connection] userId FK lastname [Device] ↳ type duration birthday [Device] ↳ stacktrace [Connection] id: , successful salary add. props users optional message optional timestamp Figure 1: Multi-Model Schema of MMSBench-Net introduced in this paper. We plan to add more scenarios deviceId: 3, (and thus suffixes) in the future, leading to a complete timstamp: "2017-07-23:14-03", suite. error: { The MMSBench-Net benchmark consists of a set of message: "Out of Memory", queries that reflect real-world use cases across the three type: "Application Error", data models. These queries are designed to evaluate vari- stacktrace: ["Error on start of..."] ous aspects of multi-model database systems, including }, their ability to handle complex data structures, support user:{ complex queries, and efficiently execute transactions. id: 34, status: "logged in" }, 2.1. Scenario users: [34, 45] MMSBench-Net is inspired by a real-world scenario of a company’s network monitoring application. Network monitoring plays a vital role in identifying and address- Figure 2: Example Status Log Showing an Error ing potential issues, threats and vulnerabilities in the network infrastructure, ensuring smooth operations and preventing data breaches or downtime. The monitoring its “purchase year” and other relevant information. application continuously collects all kinds of information In irregular intervals, each device produces a semi- about the network, including logged-in devices, usage structured log-entry containing information about its statistics and log messages, resulting in huge amounts of current state. An example of such a log can be seen in heterogeneous data. Figure 2. These logs might contain error information, in- The network monitoring application modeled by MMS- dicating problems with the device. All log entries include Bench-Net maintains data in three data models, (i) a the properties deviceId, timestamp and users. However, graph part, modeling the topology of the network, (ii) a additional properties with varying levels of nesting are document part, which consists of semi-structured logs randomly generated for each log entry. produces by the devices and (iii) a relational part, which An important information for monitoring a network holds basic information about the users and recorded is which person is currently associated with which de- data about their access patterns. The complete schema is vices. For this scenario, we assume a rather simple user depicted in Figure 1. database represented as a relational table containing in- The topology of the network is saved as a graph, where formation on the employee. Furthermore, there is also a each device in the network is represented as a node, and table for recording successful and failed login attempts a network connection between two devices is modeled and for accounting the usage of devices. Hence, this as an edge. For both, the devices modeled as nodes and scenario necessitates the database system to deal with the connection modeled as edges, additional information heterogeneous read and write workloads. is being stored, such as, the “manufacturer” of a device, 2.2. Schema and Data Generation relational queries. The collected queries are then sequen- tially executed on the database systems. To generate the schema and to populate it with realis- tic, but artificially created data, MMSBench-Net starts with building a simulation. This simulation includes the 2.3. Workload Generation graph representing the network that is being monitored, A workload consists of a collection of randomly chosen as well as the users interacting with it. The nodes in the queries according to a configurable distribution. Since graph represent devices (e.g., computers, mobile phones, the order in which queries are executed can impact the switches, and routers). The edges between these nodes performance of a database system (e.g., due to concur- represent network connections between these devices. rency effects and locking), the implementation needs to The simulation utilizes the defined topology to gener- make sure that the workload is identical for all systems ate meaningful workloads. By making changes to this under evaluation (e.g., by using the same seed). MMS- topology, it becomes possible to adjust the distribution Bench-Netuses a variety of queries to build its workloads: of available targets for the queries. This enables to easily align with specific requirements and desired focus of a Read Device or Connection Selects a device or con- workload. nection and retrieves it partially or fully. One of The process of generating this simulated network con- the static parameters is chosen for this. sists of multiple steps: Read Log Selects a device and reads all or parts of its User Generation First, a configurable number of users logs. Filters as well as projections of underlying is being generated. keys are chosen from the target device. Generation of Devices For each type of device (e.g., Remove Device Selects a device and deletes it, also switches, computers), a random number (within a all connections to this device are deleted as well. configurable range) of devices is being generated. Logs are deleted as well, information on log-in attempts are kept. Device Properties and Logs Generation For each de- vice, a random set of properties is being generated. Remove Connection Randomly selects a connection Furthermore, a set of login, as well as status logs between two network devices and deletes it. is added as well. Add Device Adds a device to the network. Generate Generation of Connections According to the layout new connections to existing devices. of the network, multiple pairs of devices are se- lected and connections between them are created. Remove Logs Randomly selects a device and deletes some of its logs. Connection Properties Generation In contrast to the devices, connections do not have status logs, but Add Logs Creates a random log message and adds it to they also create multiple dynamic properties. existing devices or connections. Add User Creates a new user. All attributes are ran- After the generation of the network is done, it is used domly generated. as a template to create the workload. Each workload consists of queries in a query language supported by the Remove User Randomly selects a user who is being system under evaluation. deleted. A distinction is made between the three data models. First, the graph data is handled as already seen in Figure 1. Change User Randomly selects a user and adjusts an For this, each device is represented as a node and each attribute. connection is translated to an edge which connects them. The small set of dynamic properties is inserted directly Besides simple queries, there are also more complex as part of these nodes and edges (if properties are not retrieval operations which can be chosen, their frequency supported by the data model, these are handled as if they is also configurable. are unstructured data). Then all generated device logs Connectivity Checks “Find all similar connected de- (an example can be seen in Figure 2), are translated to vices” or “Find connected device of specific type” document queries. Each entity of type device translates its nested status logs into multiple document queries, Error Analysis “Identify the top 10 most common errors” each containing a timestamp and the ID of the device. or “Calculate the percentage of errors caused by Finally, all login records are collected from the devices each user” and together with the user data itself are translated into Login Activity “Successful logins by user and month” or provide fast performance while adhering to these guaran- “Average duration of successful logins by user and tees. It also supports unstructured data and basic graph hour of the day” functionality, which makes it a suitable choice for this comparison. SurrealDB was designed with the goal of re- Firstly, the actions are selected and implemented on the ducing the number of joins required for retrieval queries. simulated network, while concurrently being captured It accomplishes this objective by utilizing a graph struc- and converted into queries for the evaluated systems. ture that allows a tuple to any other tuple. SurrealQL, a Once the simulation concludes, the gathered queries are SQL-like query language, is the primary means of inter- distributed across a configurable number of available acting with the system, which can be accessed through threads and executed on the evaluated system. The exe- either a REST or a web socket interface. cution time for each query is measured individually and recorded for subsequent analysis. This facilitates a com- prehensive analysis of various aspects of the database 4. Evaluation systems. Each iteration of this workload generation and execution process is referred to as a cycle; in an evalua- Our evaluation uses Chronos [8], an ‘evaluation-as-a- tion, multiple cycles can be chained together to construct service’ framework which allows to easily execute differ- more extensive workloads. ent system evaluations and configurations in parallel. To achieve this, it manages a collection of nodes, which are used to execute these different evaluation configurations. 3. Evaluated Systems The evaluation machines used for obtaining the results presented in this paper are equipped with an Intel Xeon To showcase the capabilities of MMSBench-Net, two X5650 24-core CPU with 24 GiB of RAM. All machines multi-model databases have been chosen to be evalu- run Ubuntu 22.04 LTS (with kernel version 5.15.0-37) and ated: Polypheny and SurrealDB. These two systems have the same patch level. As Java runtime environment, we been selected since they follow completely opposite ap- use OpenJDK version 17.0.3. The presented numbers are proaches for implementing multiple data models beneath the median over three runs. one facade. While Polypheny maintains the individual Each run uses either a SurrealDB instance in a Docker4 models independently, SurrealDB follows a more mono- container, deployed from scratch and configured to use lithic approach by combining all data models in one uni- a persistent on-file configuration, or a fresh Polypheny fied model. instance. The Polypheny instance uses a MongoDB5 store for the document data, a Neo4j6 store for the graph 3.1. Polypheny data, and a PostgreSQL7 store for the relational data. Each of these stores is deployed by Polypheny using Polypheny [5, 6] is a PolyDBMS [7], which is a multi- Docker containers, this requires less setup than bare- model database system built according to the architecture metal deployments and achieves similar performance [9]. principle of a polystore and supporting multiple query Both Polypheny and SurrealDB have indexes on their languages. Data can be represented according to the re- primary keys. We provide a reference implementation of lational, the document and the labeled-property graph the benchmark, including all configurations and the raw data models. Polypheny utilizes multiple highly opti- results8 . mized database systems like HypherSQL3 , MongoDB, As a first overview comparison, the default configu- Neo4j, and PostgreSQL as storage and execution engines. ration of the benchmark, simulating a network with 10 To achieve competitive performance, Polypheny utilizes users and around 65 devices, is being used. All scaling pa- these underlying data stores to push down queries. Queries rameters are configured to only allow for a slight growth not supported by the underlying data store are executed of the network. The different runtimes after multiple within Polypheny itself. Polypheny also provides support cycles of workloads can be seen in Figure 3. for transactions with ACID guarantees. With such a small network and thus a low number of queries, SurrealDB manages to execute the workloads 3.2. SurrealDB faster than Polypheny, even when the amount of queries increases. If one observes the results grouped by the SurrealDB is a multi-model database management sys- query model, Polypheny is faster than SurrealDB for the tem that provides traditional database guarantees, such relational queries, this can be seen in Figure 4. as ACID transactions, persistent data storage, and fine- 4 grained data access control. Its primary objective is to 5 https://www.docker.com/ https://www.mongodb.com/ 6 https://neo4j.com/ 7 https://www.postgresql.org/ 3 8 https://hsqldb.org/ https://download-dbis.dmi.unibas.ch/paper/GvDB23.zip Polypheny Polypheny Total Runtime (log scale, ns) 1013 Total Runtime (log scale, ns) SurrealDB SurrealDB 13 10 1012 1012 1011 1011 1010 10 20 30 40 50 60 70 80 90 100 1x 2x 3x 4x 5x Cycles Device Scaling Figure 3: Runtime with Increasing Number of Cycles Figure 5: Runtime with Increasing Number of Devices Polypheny Mean Query Runtime (log scale, ns) Polypheny Total Runtime (log scale, ns) 103 SurrealDB SurrealDB 102 1011 20 40 60 80 100 20% 40% 60% 80% 100% Cycles Ratio of Complex Query Figure 4: Mean Relational Query Runtime with Increasing Figure 6: Runtime for Increasing Ratio of Complex Queries Number of Cycles 5. Related Work However, in most real-world scenarios, the network starts with a significantly higher number than the 10 One of the first prominent benchmarks for evaluating users used by default. Thus, the number of users is being database management systems was the Wisconsin [10] adjusted, leading to a higher number of user logins and benchmark, introduced in 1983. The space of multi- therefore more relational workload. With an increasing model database evaluation, in contrast, has a rather short ratio of relational workload, Polypheny is able to per- history. One of the first ones being BigBench [11], in- form similar to SurrealDB. This behavior is similar if the troduced by TPC as TPCx-BB. BigBench uses a schema, number of devices in the network is increased. While which combines structured, semi-structured and unstruc- this does not increase the ratio of the relational workload, tured data. But beside TPC, there has been an increase in compared to the other data models, it still results in better work, which provides similar benchmarks to the one pro- overall performance of Polypheny, which is depicted in posed in this paper. In [12], a benchmark using key-value, Figure 5. Figure 6 depicts a comparison of different ratios column, document and graph data is used to compare of complex queries in the workload. ArangoDB9 and OrientDB10 against a combination of The results obtained from the evaluation of the two single-model databases, using a proposed synthetic gener- quite different systems confirms the concepts of the MMS- ated benchmark. They were able to show, that depending Bench-Net benchmark, in particular that it is agnostic to on the scenario, multi-model databases can be faster than the concrete database under evaluation and has a wide configurations combining multiple single-model database applicability for the evaluation of single- and multi-model systems. UniBench [13] targets the same data models as database systems in realistic settings. MMSBench-Net, but also considers key-value and XML 9 https://www.arangodb.com/ 10 https://orientdb.org/ data. It puts great effort in modeling an as realistic as benchmark allows for a fair comparison of different sys- possible social-commerce scenario. M2Bench [14] relies tems, and our results provide insights into the perfor- heavily on existing benchmark datasets and extends the mance of Polypheny and SurrealDB under different work- used data models of UniBench by introducing the array loads. Ultimately, this benchmark will guide the devel- model into its evaluation. opment and evaluation of novel multi-model database systems. 6. Future Work Acknowledgments Our goal for MMSBench-Net is to extend it into a bench- marking suite that offers various real-world usage scenar- This work was partly supported by the SNSF (“Polypheny- ios for multi-model data management. However, there DDI: A Flexible Polystore-based Distributed Data Infras- are some limitations with MMSBench-Net that we need tructure”, grant no. 200020_213121). The authors would to address in the future. like to thank R. Arnold, R. Gasser, S. Heller, L. Sauter, First, the minimal set of queries that we have chosen F. Spiess and A. Mbilinyi for their valuable feedback. for evaluation may not be representative of all possible ways to query multi-model systems. In future evalua- tions, we should include a more diverse set of queries to References reflect the range of possibilities when querying these sys- [1] E. F. Codd, A relational model of data for large tems. This would provide more comprehensive results shared data banks, Communications of the ACM and strengthen the obtained conclusions. 13 (1970) 377–387. doi:10/dwxst4 . Second, the current composition of workloads is too [2] T. P. P. Council, TPC benchmark c revision 5.11, broad and general to allow for nuanced comparisons 2010. URL: https://tpc.org/tpc_documents_current_ of multi-model systems. We need to create more fine- versions/pdf/tpc-c_v5.11.0.pdf. grained workloads that focus on specific aspects of data [3] T. P. P. Council, TPC benchmark h standard revision models to capture the subtle differences between these 3.0.1, 2022. URL: https://tpc.org/tpc_documents_ systems. current_versions/pdf/tpc-h_v3.0.1.pdf. In addition to the limitations of the benchmark, our [4] B. F. Cooper, A. Silberstein, E. Tam, R. Ramakrish- evaluation only compared two systems, leaving a lot of nan, R. Sears, Benchmarking cloud serving systems unexplored territory. Future evaluations should include with YCSB, in: Proc. SoCC’10, ACM Press, 2010, pp. additional systems such as ArangoDB and OrientDB to 143–154. doi:10/cxjrfd . gain more insights into their performance. Although not [5] M. Vogt, Adaptive Management of Multimodel Data all multi-model databases support the same data models, and Heterogeneous Workloads, Ph.D. thesis, Uni- it is possible to use parts of unsupported data models or versity of Basel, 2022. doi:10/j44k . substitute them with other models to expand the range [6] M. Vogt, N. Hansen, J. Schönholz, D. Lengweiler, of systems that can be evaluated. I. Geissmann, S. Philipp, A. Stiemer, H. Schuldt, Lastly, we should consider evaluating configurations Polypheny-DB: Towards bridging the gap between that use a combination of multiple single-model databases polystores and HTAP systems, in: Proc. Poly’21, to facilitate interesting comparisons. By addressing these LNCS, Springer, 2020, pp. 25–36. doi:10/gnxv2h . limitations, we can develop a more comprehensive and [7] M. Vogt, D. Lengweiler, I. Geissmann, N. Hansen, nuanced benchmarking suite that offers a more accurate M. Hennemann, C. Mendelin, S. Philipp, H. Schuldt, evaluation of multi-model systems. Polystore systems and DBMSs: Love marriage or marriage of convenience?, in: Proc. Poly’21, volume 7. Conclusion 12921 of LNCS, Springer, 2021, pp. 65–69. doi:10/ gn8qvm . In this paper, we introduced the MMSBench-Net, a new [8] M. Vogt, A. Stiemer, S. Coray, H. Schuldt, Chronos: benchmarked superficially tailored to benchmark multi- The swiss army knife for database evaluations, model database systems that is based on the scenario in: Proc. EDBT’20, OpenProceedings.org, 2020, pp. of a network monitoring application. Our evaluation of 583–586. doi:10/g8w5 . Polypheny and SurrealDB demonstrates the effectiveness [9] W. Felter, A. Ferreira, R. Rajamony, J. Rubio, An up- and applicability of the proposed benchmark. dated performance comparison of virtual machines Our research represents an important first step to- and Linux containers, in: Proc. ISPASS’15, 2015, pp. wards establishing a comprehensive evaluation method- 171–172. doi:10/gfvg6d . ology for multi-model database systems. The proposed [10] H. Boral, D. J. DeWitt, A methodology for database system performance evaluation, in: Proc. SIG- MOD’84, ACM, 1984, pp. 176–185. doi:10/fk5fbn . [11] C. Baru, M. Bhandarkar, C. Curino, et al., Discussion of BigBench: A Proposed Industry Standard Perfor- mance Benchmark for Big Data, in: Performance Characterization and Benchmarking. Traditional to Big Data, Springer, 2015, pp. 44–63. doi:10/j44q . [12] F. R. Oliveira, L. del Val Cura, Performance Evalua- tion of NoSQL Multi-Model Data Stores in Polyglot Persistence Applications, in: Proc. IDEAS’16, ACM, 2016, pp. 230–235. doi:10/j44n . [13] C. Zhang, J. Lu, P. Xu, Y. Chen, UniBench: A Bench- mark for Multi-model Database Management Sys- tems, in: Performance Evaluation and Benchmark- ing for the Era of Artificial Intelligence, Springer, 2019, pp. 7–23. doi:10/j44m . [14] B. Kim, K. Koo, U. Enkhbat, S. Kim, J. Kim, B. Moon, M2Bench: A Database Benchmark for Multi-Model Analytic Workloads, Proceedings of the VLDB En- dowment 16 (2022) 747–759. doi:10/j44p .