NeoEMF: a Multi-database Model Persistence Framework for Very Large Models Gwendal Daniel1 , Gerson Sunyé1 , Amine Benelallam1 , Massimo Tisi1 , Yoann Vernageau1 , Abel Gómez2;4 , and Jordi Cabot3;4 1 AtlanMod Team Inria, Mines Nantes & Lina ffirst name.last nameg@inria.fr 2 Departamento de Informtica e Ingeniera de Sistemas Universidad de Zaragoza abel.gomez@unizar.es 3 ICREA jordi.cabot@icrea.cat 4 Internet Interdisciplinary Institute Universitat Oberta de Catalunya Abstract. The growing use of Model Driven Engineering (MDE) techniques in industry has emphasized scalability of existing model persistence solutions as a major issue. Specifically, there is a need to store, query, and transform very large models in an efficient way. Several persistence solutions based on relational and NoSQL databases have been proposed to achieve scalability. However, ex- isting solutions often rely on a single data store, which suits a specific modeling activity, but may not be optimized for other use cases. In this article we present N EO EMF, a multi-database model persistence framework able to store very large models in key-value stores, graph databases, and wide column databases. We in- troduce N EO EMF core features, and present the different data stores and their applications. N EO EMF is open source and available online. Keywords: Model Persistence, Scalability, Large Models 1 Introduction With the progressive adoption of MDE techniques in industry [10], existing model per- sistence solutions have to address scalability issues to store, query, and transform large and complex models [13]. Indeed, existing modeling frameworks were first designed to handle simple modeling activities, and often rely on XMI-based serialization to store models. While this format is a good fit for small models, it has shown clear limitations when scaling to large ones [11]. To overcome these limitations, several persistence frameworks based on relational and NoSQL databases have been proposed [5,7,11]. They rely on a lazy-loading mech- anism, which reduce memory consumption by loading only accessed objects. These solutions have proven their efficiency compared to state-of-the-art tools, but they are often tailored to a specific data-store implementation. In these approaches, the choice of the datastore is totally decoupled from the ex- pected model usage (for example complex querying, interactive editing, or complex model-to-model transformation): the persistence layer offers generic scalability im- provements, but is not optimized for a specific scenario. For example, a graph-based representation of a model can improve scalability by offering a lazy-loading mecha- nism, but will have poor execution time performance in scenarios involving repeated atomic value accesses. Our previous work on model persistence have shown that providing a well-suited data store for a specific modeling scenario can dramatically improve performance. Based on this observation, we present in this article N EO EMF, a scalable model per- sistence framework based on a modular architecture enabling model storage into mul- tiple data stores. Currently, N EO EMF provides three implementations-map, graph, and column–each one optimized for a specific usage scenario. N EO EMF provides two APIs, one strictly compatible to the Eclipse Modeling Framework (EMF) API, easing its in- tegration into existing modeling tools, and an advanced API that provides specific fea- tures that bypass the standard EMF API to further improve scalability of particular modeling scenarios. The rest of the paper is organized as follows: Section 2 presents an overview of the N EO EMF architecture, Section 3 and 4 present the core features of the framework and the different datastores. Section 5 provides insights on the framework’s implementation, and finally Section 6 summarizes the key points of the paper and presents our future work. Note that examples of N EO EMF usages are provided on N EO EMF’s wiki5 and in a demonstration video available online at https://youtu.be/_OyWpcOMOfA. It highlights N EO EMF’s core features such as model import, API usage, and the lazy model editor which allows to navigate interactively large models with a low memory footprint. The demonstration also present a concrete use case by showing the different steps needed to integrate N EO EMF into an existing EMF-based application, and use it to store and query models containing several million of elements. Finally, the demon- stration presents an overview of two tools developed on top of N EO EMF: the Mogwaı̈ query framework and ATL-MR, a distributed version of the ATL transformation engine. 2 Architecture Overview Figure 1 describes the integration of N EO EMF in the EMF ecosystem. Modelers typ- ically access a model using Model-based Tools, which provide high-level modeling features such as a graphical interface, interactive console, or query editor. These fea- tures internally rely on EMF’s Model Access API to navigate models, perform CRUD operations, check constraints, etc. In its core, EMF delegates the operations to a per- sistence manager using its Persistence API, which is in charge of the serialization/de- serialization of the model. The N EO EMF core component is defined at this level, and can be registered as a persistence manager for EMF, same as, for example, the default XMI persistence manager. This design makes N EO EMF both transparent to the client- application and EMF itself, that simply delegates calls without taking care of the actual storage. 5 https://github.com/atlanmod/NeoEMF/wiki Model-Based Tools Model Access API “Standard” EMF Modeling User Persistence API NeoEMF Core Caching /Graph /Map /Column Backend API “Advanced” User & Developer Blueprints MapDB HBase/ ZooKeeper Fig. 1: N EO EMF Integration in EMF Ecosystem Once the core component has received the modeling operation to perform, it for- wards the operation to the appropriate database driver (Map, Graph , or Column), which is in charge of handling the low-level representation of the model. These connec- tors translate modeling operations into Backend API calls, store the results, and reify database records into EMF EObjects when needed. N EO EMF also embeds a set of de- fault caching strategies that are used to improve performance of client applications, and can be configured transparently at the EMF API level. In addition to this transparent integration into existing EMF applications, N EO EMF provides a specific API, which targets advanced users / high-performance applications. This API provides utility methods which overcome EMF limitations, allow fine-grained tuning of the databases, and access to internal caches. By using this API, N EO EMF can be tuned to improve execution time and/or scalability of a specific modeling scenario. To provide this smooth integration into the EMF infrastructure, the N EO EMF core component redefines the behavior of several EMF classes. For instance, each N EO EMF driver defines a specific implementation of PersistenceBackendFactory that is responsible of the concrete data store creation. This factory creates an instance of the data store that corresponds to the Resource options. Once the data store has been created, the driver instantiates a specific implementation of the EStore interface–also depending on the Resource options–that translates the delegated method calls into data- store specific API calls. This architecture allows to change the underlying data store by simply updating the Resource options. The EStore also returns Persistent- EObject from the database when needed, using a specific reification mechanism. 3 Software Features As introduced in the previous Section, N EO EMF provides two API levels: one for a standard use of existing EMF applications / APIs, and one advanced that allows to bypass EMF’s limitations, tune internal data stores, and configure caches. In this Section we present first the standard features, available simply by plugging N EO EMF into an existing application, then we introduce its advanced features. 3.1 Standard Features An important characteristic of N EO EMF is its compliance with the EMF API. All class- es/interfaces extending existing EMF ones strictly define all their methods, and we put a special attention to ensure that calling a N EO EMF method produces the same behav- ior (including possible side effects) as standard EMF API calls. As a result, existing applications can integrate N EO EMF with a very small amount of efforts and bene- fit immediately from N EO EMF scalability improvements. Existing code manipulating regular EMF EObjects does not have to be modified, and will behave as expected. Specifically, N EO EMF supports the following EMF features: – Code generation: N EO EMF provides a dedicated code generator that transpar- ently extends the EMF one, and allows client applications to manipulate models using generated java classes. – Reflexive/Dynamic API: reflexive and dynamic EMF methods (eSet, eGet, eUnset, eDynamicGet, eDynamicSet ...) can be used on N EO EMF objects, and behave as their standard implementations. – Resource API: N EO EMF also implements the resource specific API, such as get- Contents, getAllContents, save, and load. In addition, N EO EMF takes advantage of the flexible save and load options to enable backend-specific cus- tomizations. As other model persistence solutions [5, 11], N EO EMF achieves scalability using a lazy-loading mechanism, which loads into memory objects only when they are ac- cessed, overcoming XMI’s limitations. Lazy-loading is defined at the core component: N EO EMF implementation of EObject consists of a simple wrapper delegating all its method calls to an EStore, that directly manipulates elements at the database level. Us- ing this technique, N EO EMF benefits from datastore optimizations (such as caches), and only maintains a small amount of elements in memory (the ones that have not been saved), reducing drastically the memory consumption of modeling applications. 3.2 Advanced Features In addition to its compliance with the EMF API, N EO EMF provides specific utility fea- tures to tackle EMF’s limitations, such as the List allInstances(EClass eClass ) method, which is accessible through the PersistentResource interface. This feature tack- les the problem of allInstances computation in EMF [14] by delegating it to the data store, allowing to retrieve requested element fastly, using data store indexes, or specific data representation. N EO EMF also includes an io module, providing a scalable Model Importer, that consists of an event-based XMI parser that bypasses the EMF API to efficiently store the model in a dedicated database with a low memory footprint. The importer is designed to be generic and can be implemented in each backend component. We also plan to add an efficient Model Exporter module that would allow to produce optimized model serializations from their database representation. Finally, NeoEMF contains a set of caching strategies that can be plugged on top of the data store according to specific needs. Note that these caches are available for all connectors, unless otherwise stated. – EStructuralFeaturesCaching: a LRU cache storing loaded objects by their ac- cessed feature. – IsSetCaching: a cache keeping the result of isSet calls to avoid multiple ac- cesses to the database. – SizeCaching: a cache keeping the result of size calls on multi-valued features to avoid multiple accesses to the database. – RecordCaches: a set of database-specific caches maintaining a list of records to improve execution time. These caches can be configured using the save and load Resource methods, which al- lows to add specific options which are then forwarded to the appropriate Persis- tenceBackendFactory. 4 Datastores The previous features are available for a variety of data stores supported by N EO EMF. In this section we introduce the different datastores available. We introduce briefly model representation in these stores and describe their differences and the specific mod- eling scenario they better address. Both, standard and advanced, features presented in the previous section are implemented in the supported datastores. 4.1 N EO EMF/M AP N EO EMF/M AP [7] has been designed to provide fast access to atomic operations, such as accessing a single element/attribute, and navigating a single reference. This imple- mentation is optimized for EMF API-based accesses, which typically generate atomic and fragmented calls on the model. N EO EMF/M AP embeds a key-value store, which maintains a set of in-memory/on disk maps to speed up model element accesses. The benchmarks performed in previous work [7] show that N EO EMF/M AP is the most suit- able solution to improve performance and scalability of EMF API-based tools that need to access very large models on a single machine. N EO EMF/M AP data model is composed of three different maps that store model information: (i) a property map, which keeps all objects data in a centralized place; (ii) a type map, which tracks how objects relate to the meta-level (such as the instance of relationships); and (iii) a containment map, which defines the model structure in terms of containment references. 4.2 N EO EMF/G RAPH N EO EMF/G RAPH [2] persists models in an embedded graph database that represents model elements as vertices, attributes as vertex properties, and references as edges. Metamodel elements are also persisted as vertices in the graph, and are linked to their instances through the INSTANCE_OF relationship. Using graphs to store models allows N EO EMF to benefit from the rich traversal features that graph databases usually provide, such as fast shortest-path computation, or efficient complex navigation paths among several vertices/edges. These advanced query capabilities have been used to develop the Mogwaı̈ [4] tool, that maps OCL expressions to graph navigation traversals. On the other hand, graph databases are not well-suited to compute atomic accesses of single elements or attributes, which are typical queries computed in interactive model edition. 4.3 N EO EMF/C OLUMN N EO EMF/C OLUMN [6] has been designed to enable the development of distributed MDE-based applications by relying on a distributed column-based datastore. N EO EM- F/C OLUMN uses a single table with three column families to store model information: (i) a property column family that keeps all objects data stored together; (ii) a type col- umn family that tracks how objects relate to the meta-level (such as the instance of relationships); and (iii) a containment column family that defines the model structure in terms of containment hierarchy. In contrast with Map and Graph implementations, N EO EMF/C OLUMN offers con- current read/write capabilities and guarantees ACID properties at model element level. It exploits the wide availability of distributed clusters in order to distribute intensive read/write workloads across datanodes. The distributed nature of this persistence so- lution is used in the ATL-MR [3] tool, a distributed engine for model transformations in the ATL language on top of MapReduce. N EO EMF/C OLUMN, enables the cluster’s nodes to share read/write rights over the same set of input/output models. 5 Implementation N EO EMF has been implemented as a set of open source Eclipse plugins distributed under the EPL license. The N EO EMF website6 presents an overview of the key features and current ongoing work, and the source code repository is fully available on GitHub (https://github.com/atlanmod/NeoEMF). N EO EMF has been released as part of the MONDO platform [8]. The N EO EMF/G RAPH implementation relies on Blueprints [12], an interface de- signed to unify graph databases under a common API. Blueprints has been implemented by a large number of databases, such as Neo4j, OrientDB, and Titan. The use of this abstraction layer on top of graph databases enable client applications to use the graph implementation of their choice, as long as it implements the Blueprints API. For now, N EO EMF/G RAPH embeds Blueprints 2.5.0 and provides a convenience wrapper for Neo4j 1.9.6. An implementation relying on the new Blueprints API (called Tinkerpop3) is under study for now, as well as the creation of additional database wrappers. N EO EMF/M AP embeds the key-value store MapDB 1.0.9. MapDB provides Maps, Sets, Lists, Queues and other collections backed by off-heap or on-disk storage, and describes itself as a hybrid between Java Collections and an embedded database en- gine [9]. It provides advanced features such as ACID transactions, snapshots, and in- cremental backups. N EO EMF/M AP relies on the set of Maps provided by MapDB and uses them as a key-value store. N EO EMF/C OLUMN is built on top of Apache HBase [1] 0.98.13-hadoop2, a non- relational wide column database providing distributed data storage on top of HDFS. It 6 www.neoemf.com is able to host very large tables–billions of rows containing millions of columns–atop clusters of commodity hardware. Model distribution is hidden from client applications, which accesses the elements transparently using the standard EMF API. 6 Conclusion In this article we have presented NeoEMF, a multi-datastore model persistence frame- work. It relies on a lazy-loading capability allowing very large model navigation in a reduced amount of memory, by loading elements when they are accessed. NeoEMF pro- vides three implementations that can be plugged transparently to provide an optimized solution to different modeling use cases: atomic accesses through interactive editing, complex query computation, and cloud-based model transformation. References 1. Apache. Apache HBase, 2016. URL: https://hbase.apache.org/. 2. Amine Benelallam, Abel Gómez, Gerson Sunyé, Massimo Tisi, and David Launay. Neo4EMF, a Scalable Persistence Layer for EMF Models. In Proc. of the 10th ECMFA, pages 230–241, York, United Kingdom, 2014. 3. Amine Benelallam, Abel Gómez, Massimo Tisi, and Jordi Cabot. Distributed Model-to- Model Transformation with ATL on MapReduce. In Proc. of the 8th SLE Conference, pages 37–48. ACM, 2015. 4. Gwendal Daniel, Gerson Sunyé, and Jordi Cabot. Mogwaı̈: a Framework to Handle Complex Queries on Large Models. In Proc of the 10th RCIS Conference (to appear). IEEE, 2016. Available Online at http://tinyurl.com/jgopmvk. 5. Eclipse Foundation. The CDO Model Repository (CDO), 2016. URL: http://www. eclipse.org/cdo/. 6. Abel Gómez, Amine Benelallam, and Massimo Tisi. Decentralized Model Persistence for Distributed Computing. In Proc. of the 3rd BigMDE Workshop, pages 42–51. CEUR- WS.org, 2015. 7. Abel Gómez, Gerson Sunyé, Massimo Tisi, and Jordi Cabot. Map-based Transparent Persis- tence for Very Large Models. In Proc. of the 18th FASE Conference, pages 19–34. Springer, 2015. 8. Dimitrios S. Kolovos, Louis M. Rose, Richard F. Paige, Esther Guerra, Jesús Sánchez Cuadrado, Juan de Lara, István Ráth, Dániel Varró, Gerson Sunyé, and Massimo Tisi. MONDO: Scalable Modelling and Model Management on the Cloud. In Proc. of the Projects Showcase, (STAF 2015), pages 44–53, 2015. 9. MapDB. MapDB, 2016. URL: www.mapdb.org. 10. Parastoo Mohagheghi, Miguel A Fernandez, Juan A Martell, Mathias Fritzsche, and Wasif Gilani. MDE Adoption in Industry: Challenges and Success Criteria. In Models in Software Engineering, pages 54–59. Springer, 2009. 11. Javier Espinazo Pagán and Jesús Garcı́a Molina. Querying Large Models Efficiently. IST, 2014. 12. Tinkerpop. Blueprints API, 2016. URL: blueprints.tinkerpop.com. 13. JB Warmer and AG Kleppe. Building a Flexible Software Factory using Partial Domain Specific Models. In Proc. of the 6th OOPSLA DSM Workshop. University of Jyvaskyla, 2006. 14. Ran Wei and Dimitrios S Kolovos. An Efficient Computation Strategy for allInstances(). In Proc. of the 3rd BigMDE Workshop, pages 32–42. CEUR-WS.org, 2015.