=Paper=
{{Paper
|id=Vol-3250/messpaper2
|storemode=property
|title=Key-Value vs Graph-based data lakes for
realizing Digital Twin systems (Poster) 
|pdfUrl=https://ceur-ws.org/Vol-3250/messpaper2.pdf
|volume=Vol-3250
|authors=Daniel Pérez-Porras,Paula Muñoz,Javier Troya,Antonio Vallecillo
|dblpUrl=https://dblp.org/rec/conf/staf/Perez-Porras0TV22
}}
==Key-Value vs Graph-based data lakes for
realizing Digital Twin systems (Poster) ==
<pdf width="1500px">https://ceur-ws.org/Vol-3250/messpaper2.pdf</pdf>
<pre>
Key-Value vs Graph-based data lakes for realizing
Digital Twin systems (Poster)
Daniel Pérez-Porras, Paula Muñoz, Javier Troya and Antonio Vallecillo
ITIS Software. University of Málaga, Spain

Keywords
Digital twins, Data Lake, NoSQL databases, Graph databases


1. Introduction
A Digital Twin (DT) is a comprehensive digital representation of an actual system,
service or product (the Physical Twin, PT), synchronized at a specified frequency and
fidelity [1]. The digital twin includes the properties, condition and behavior of the
physical entity through models and data, and is continuously updated with real-time
data about the PT performance, maintenance, and health status throughout its entire
lifetime [2]. The exchange of data between the digital and the physical twins takes place
through bi-directional data connections. Additionally, a DT system can also comprise a
set of services that permit exploiting the data exchanged by the two twins [3].
   Engineering DT systems is challenging for many reasons, one of them being their
complexity [4]. The problem we would like to address in this paper is how to implement
the connections between the twins in an effective and efficient way. Usually, these
connections are achieved through a Data Lake. As defined in [5], a data lake is “a flexible,
scalable data storage and management system, which ingests and stores raw data from
heterogeneous sources in their original format, and provides query processing and data
analytics in an on-the-fly manner.”
   In a previous work [6] we defined a framework for the specification and deployment
of DT systems. It uses UML models to specify the digital twins, and connects them
through a Data Lake repository implemented in Redis (https://redis.io/), which provides
the bi-directional communication infrastructure. This open-source lightweight in-memory
data structure is optimized to deliver fast responses to a massive amount of petitions.
Redis is a key-value database and supports various abstract data structures such as
strings, lists, sets or maps (called ‘hashes’). However, Redis does not easily allow complex
queries: to retrieve hashes by the values of their fields, it is necessary to store additional
records that include the field value and a reference to the hash key. This makes the
database structure and contents dependent on the queries that need to be performed.
MESS@STAF 2022: International workshop on MDE for Smart IoT Systems, July 04–08, 2022, Nantes,
France
Envelope-Open daniperezporras@uma.es (D. Pérez-Porras); paulam@uma.es (P. Muñoz); jtroya@uma.es (J. Troya);
av@uma.es (A. Vallecillo)
Orcid 0000-0003-2939-5803 (P. Muñoz); 0000-0002-1314-9694 (J. Troya); 0000-0002-8139-9986 (A. Vallecillo)
                                    © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 Inter-
                                    national (CC BY 4.0).
 CEUR
 Workshop
 Proceedings
               http://ceur-ws.org
               ISSN 1613-0073
                                    CEUR Workshop Proceedings (CEUR-WS.org)
   Alternative solutions use time series databases, such as InfluxDB or TimescaleDB, and
even define temporal models for handing them at a higher level of abstraction [7]. While
very efficient for querying time-sensitive information, these solutions are not optimal for
implementing more general queries, needed when efficient data analysis is required.
   In this work we explore the use of Graph databases [8] to store and query the information
handled in data lakes. Graph databases use graph structures to perform semantic queries;
they store the information as nodes, edges, and properties. Examples of Graph databases
include Neo4j, ArangoDB or OrientDB, to name a few. They all count with specialized
query languages such as Gremlin, Cypher, SPARQL, or GraphQL.
   We have implemented a data lake using Neo4j, and evaluated its performance and
expressiveness against our previous implementation with Redis. As expected, Neo4j allows
easy specification of more complex queries using its Cypher query language. The same
queries in Redis require the addition of ad-hoc records with the corresponding key-value
mappings if the queries were not contemplated beforehand in the Redis record structure.
Interestingly, the additional queries forced by these new records introduce a performance
penalty that makes Neo4j’s response times better than those of Redis. Furthermore, the
response times obtained for simple queries in Redis are not very different from those
of Neo4j. All this makes Neo4j appear to be a better solution than Redis for realizing
data lakes. The description of the tests carried out and their results are availble from
https://github.com/atenearesearchgroup/dt-graph-database. As future work, we are defining
a benchmark with different types of queries that will be used to compare implementations
of data lakes using different technologies, including time series databases, too. We hope
that our evaluation will help to shed some light on the advantages and limitations of each
solution, and to identify situations where one type of solution outperforms the other.

References
[1] Digital Twin Consortium, Glossary of digital twins, https://www.digitaltwinconsortium.
    org/glossary/index.htm, 2021.
[2] F. Bordeleau, B. Combemale, R. Eramo, M. van den Brand, M. Wimmer, Towards
    model-driven digital twin engineering: Current opportunities and future challenges,
    in: Proc. of ICSMM’20, volume 1262 of CCIS, Springer, 2020, pp. 43–54.
[3] F. Tao, H. Zhang, A. Liu, A. Y. C. Nee, Digital twin in industry: State-of-the-art,
    IEEE Trans. Ind. Informatics 15 (2019) 2405–2415. doi:1 0 . 1 1 0 9 / T I I . 2 0 1 8 . 2 8 7 3 1 8 6 .
[4] M. Grieves, J. Vickers, Digital Twin: Mitigating Unpredictable, Undesirable Emergent
    Behavior in Complex Systems, Springer, 2017, pp. 85–113.
[5] R. Hai, C. Quix, M. Jarke, Data lake concept and systems: a survey, CoRR
    abs/2106.09592 (2021). a r X i v : 2 1 0 6 . 0 9 5 9 2 .
[6] P. Muñoz, J. Troya, A. Vallecillo, Using UML and OCL Models to Realize High-Level
    Digital Twins, in: Proc. of ModDiT2021@MODELS’21, IEEE, 2021, pp. 212–220.
[7] A. Mazak, S. Wolny, A. Gómez, J. Cabot, M. Wimmer, G. Kappel, Temporal models
    on time series databases, J. Object Technol. 19 (2020) 3:1–15.
[8] I. Robinson, J. Webber, E. Eifrem, Graph Databases: New Opportunities for Con-
    nected Data, 2 ed., O’Reilly Media, Inc., 2015.

</pre>