=Paper=
{{Paper
|id=Vol-1206/paper4
|storemode=property
|title=Improving Memory Efficiency for Processing Large-Scale Models
|pdfUrl=https://ceur-ws.org/Vol-1206/paper_6.pdf
|volume=Vol-1206
|dblpUrl=https://dblp.org/rec/conf/staf/DanielSBT14
}}
==Improving Memory Efficiency for Processing Large-Scale Models==
<pdf width="1500px">https://ceur-ws.org/Vol-1206/paper_6.pdf</pdf>
<pre>
  Improving Memory Efficiency for Processing Large-Scale
                         Models

                   Gwendal Daniel                                Gerson Sunyé                    Amine Benelallam
            AtlanMod team (Inria, Mines                  AtlanMod team (Inria, Mines          AtlanMod team (Inria, Mines
                  Nantes, LINA)                                Nantes, LINA)                        Nantes, LINA)
           gwendal.daniel@etu.univ-                       gerson.sunye@inria.fr             amine.benelallam@inria.fr
                  nantes.fr
                                                                  Massimo Tisi
                                                         AtlanMod team (Inria, Mines
                                                               Nantes, LINA)
                                                           massimo.tisi@inria.fr

ABSTRACT                                                                    These tools handle complex and large-scale models when
Scalability is a main obstacle for applying Model-Driven                    manipulating important applications, for example, during
Engineering to reverse engineering, or to any other activ-                  reverse-engineering or software modernization through model
ity manipulating large models. Existing solutions to persist                transformation. EMF was first designed to support model-
and query large models are currently inefficient and strongly               ing tools and has shown limitations in handling large models.
linked to memory availability. In this paper, we propose a                  A more efficient persistence solution is needed to allow for
memory unload strategy for Neo4EMF, a persistence layer                     partial model loading and unloading, which are key points
built on top of the Eclipse Modeling Framework and based                    when dealing with large models.
on a Neo4j database backend. Our solution allows us to
partially unload a model during the execution of a query by                 While several solutions to persist EMF models exist, most of
using a periodical dirty saving mechanism and transparent                   them do not allow partial model unloading and cannot han-
reloading. Our experiments show that this approach enables                  dle models that exceed the available memory. Furthermore,
to query large models in a restricted amount of memory with                 these solutions do not take advantage of the graph nature
an acceptable performance.                                                  of the models: most of them rely on relational databases,
                                                                            which are not fully adapted to store and query graphs.
Categories and Subject Descriptors                                          Neo4EMF [3] is a persistence layer for EMF that relies on
D.2.2 [Software Engineering]: Design Tools and Tech-
                                                                            a graph database and implements an unloading mechanism.
niques
                                                                            In this paper, we present a strategy to optimize the mem-
                                                                            ory footprint of Neo4EMF. To evaluate this strategy, we
General Terms                                                               perform a set of queries on Neo4EMF and compare them
Performance, Algorithms                                                     against two other persistence mechanisms, XMI and CDO.
                                                                            We measure performances in terms of memory consumption
                                                                            and execution time.
Keywords
Scalability, Large models, Memory footprint                                 The paper is organized as follows: Section 2 presents the
                                                                            background and the motivations for our unloading strategy.
1.    INTRODUCTION                                                          Section 3 describes our strategy and its main concepts: dirty
The Eclipse Modeling Framework (EMF) is the de facto                        saving, unloading, and extended on-demand loading. Sec-
standard for the Model Driven Engineering (MDE) com-                        tion 4 evaluates the performance of our persistence layer.
munity. This framework provides a common base for mul-                      Section 5 compares our approach with existing solutions and
tiple purposes and associated tools: code generation [4, 12],               finally, Section 6 concludes and draws the future perspec-
model transformation [9, 13], and reverse engineering [17, 6,               tives of the tool.
5].
                                                                            2. BACKGROUND
                                                                            2.1 EMF Persistence
                                                                            As many other modeling tools, EMF has adopted XMI as
                                                                            its default serialization format. This XML-based represen-
                                                                            tation has the advantage to be human readable, but has
BigMDE ’14 July 24, 2014. York, UK.                                         two drawbacks: (i) XMI sacrifices compactness for an un-
Copyright c 2014 for the individual papers by the papers’ authors. Copy-    derstandable output and (ii) XMI files have to be entirely
ing permitted for private and academic purposes. This volume is published
and copyrighted by its editors.                                             parsed to get a readable and navigational model. The former
                                                                            drawback reduces efficiency of I/O access, while the latter
increases the memory needed to load a model and limits             In this paper we focus on Neo4EMF memory footprint. We
on-demand loading and proxy uses between files. XMI does           introduce a strategy to unload some parts of a processed
not provide advanced features such as model versioning or          model and save memory during a query execution. In the
concurrent modifications.                                          previous implementation, the on-demand loading mechanism
                                                                   allows us to load only the parts of the model that are needed,
The CDO [8] model repository was built to solve those prob-        but there is no solution to remove unneeded objects from
lems. It was designed as a framework to manage large mod-          memory, especially when they were changed but not saved
els in a collaborative environment with a small memory foot-       yet.
print. CDO relies on a client-server architecture supporting
transactional accesses and notifications. CDO servers are          A reliable unload strategy needs to address two main issues:
built on top of several persistence solutions, but in practice
only relational databases are used to store CDO objects.
                                                                      • Accessibility: Contents of unloaded objects (attributes
                                                                        and referenced objects) have to remain accessible through
2.2   Graph Databases                                                   standard EMF accessors.
Graph databases are one of the NoSQL data models that
                                                                      • Transparency: The management of the object life
have emerged to overcome the limitations of relational databases
                                                                        cycle has to be independent from users, but customiz-
with respect to scale and distribution. NoSQL databases do
                                                                        able to fit specific needs, e. g., size of the Java virtual
not ensure ACID properties, but in return, they are able to
                                                                        machine, requirements on execution time, etc.
handle efficiently large-scale data in a distributed environ-
ment.
                                                                   Our strategy faces these issues by providing a dirty-saving
Graph databases are based on nodes, edges, and proper-             mechanism, which provides temporary and transparent model
ties. This particular data representation fits exactly to EMF      persistence. The object life cycle has also been modified to
models, which are intrinsically graphs (each object can be         include unloading of persisted elements.
seen as a node and references as edges). Thus, graph databases
can store EMF models without a complex serialization pro-          In this next sections, we provide an overview of the changelog
cess.                                                              used to record the modifications of the processed model.
                                                                   Then, we present dirty saving, based on the basic Neo4EMF
                                                                   save mechanism, and we describe the Neo4EMF object life
3.    NEO4EMF                                                      cycle. Finally, we describe the modifications done on the
Neo4EMF is a persistence layer built on top of the EMF
                                                                   on-demand loading feature to handle this new strategy.
framework that aims at handling large-models in a scal-
able way. It provides a compatible EMF API and a graph-
database persistence backend based on Neo4j [16].                  3.1   Neo4EMF Changelog
Neo4EMF is open source and distributed under the terms             Neo4EMF needs a mechanism to ensure synchronization be-
of the (A)GPLv3 [1].                                               tween the in-memory model and its backend representation,
                                                                   avoiding systematic unnecessary calls to the database.
In previous work [3], we introduced the basic concepts of
Neo4EMF : model change tracking and on-demand loading.             Despite the existence in EMF of a modification tracking
Model change tracking is based on a global changelog that          mechanism, the ChangeRecorder class, we decided to de-
stores the modifications done on a model during an execu-          velop an alternative solution that minimizes memory con-
tion (from creation to save). Tracking the modifications is        sumption.
done using EMF notification facilities: the changelog acts
as a listener for all the objects and creates its entries from     Neo4EMF tracks model modifications in a changelog, a se-
the received notifications. Neo4EMF uses an on-demand              quence of entries of five types:
loading mechanism to load object fields only when they are
accessed. Technically, each Neo4EMF object is instantiated
                                                                   Object creation: A new object has been created and at-
as an empty container. When one of its fields (EReferences
                                                                       tached to a Neo4EMF resource.
and EAttributes) is accessed, the associated content is
loaded. This mechanism presents two advantages: (i) the            Object deletion: An object has been deleted or removed
entire model does not have to be loaded at once and (ii)               from a Neo4EMF resource.
unused elements are not loaded.
                                                                   Attribute modifications: Attribute setting and unsetting.
Neo4EMF does not use the EStore mechanism. Indeed,
                                                                   Reference addition: Assignment of a new single-valued
EStore allows the EObject data storage to be changed by
                                                                       reference or addition of a new referenced object in a
providing a stateless object that translates model modifi-
                                                                       multi-valued one.
cations and accesses into backend calls. Every generated
accessor and modifier delegates to the reflexive API. As           Reference deletion: Unsetting a single-valued reference
a consequence, EObjects have to fetch through the store                or removing a referenced object in a multi-valued one.
each time a field is requested, engendering several database
queries. On the contrary, Neo4EMF is based on regular
EObjects (with in-memory fields) which are synchronized            We distinguish unidirectional and bidirectional reference mod-
with a database backend.                                           ifications for performance reasons (they are not serialized the
same way during the saving process).
Figure 1 summarizes our changelog model. All changelog               Figure 2:                Excerpt of MoDisco Java Metamodel
entries are subclasses of Entry, which defines some shared
properties: the object concerned by the modification (for                    Package          owned_elements     *
                                                                                                                     Clas s Declaration          body_declarations   *
                                                                                                                                                                            BodyDeclaration


instance the object containing a modified attribute or ref-               name : S tring                              name : S tring                                         name : S tring


erence, or the new object in case of a CreateObject entry)                                                                    comments
                                                                                                                              *

and a basic serialization method.                                                                                       Comment

                                                                                                                     Content : S tring


Attribute and reference modification entries (SetAttribute,
AddLink, RemoveLink and their subclasses) have three
additional fields to track fine-grained modifications: the up-
dated feature (attribute or reference identifier) which cor-          Figure 3:                Sample instance of Java Metamodel
responds to the modified field of the concerned object, the
new and old values of the feature (if available).
                                                                            p1 : Package        owned_elements        cl1 : Clas s Declaration

                                                                          name : "package1"                              name : "clas s 1"

This decomposition provides a direct access to the informa-                                                                       body_declarations                        com1 : Comment

tion required during the serialization process, without ac-                                                            b1 : BodyDeclaration
                                                                                                                                                     comments
                                                                                                                                                                         content : "comment1"


cessing the concerned objects. The fine-grained entry man-                                                               name : "body1"              comments
                                                                                                                                                                           com2 : Comment

agement also decreases memory consumption. For instance                                                                                                                  name : "comment2"


modifications on bidirectional references correspond to a sin-
gle changelog entry, while they needed two basic entries be-
fore. Serialization of those entries is also more efficient since
it reduces the number of database accesses.                         sures that, when the resource is deleted, all the related en-
                                                                    tries are also deleted. In the previous version, entries could
In the previous version of Neo4EMF, we used the EMF noti-           not be deleted from the global changelog, and were kept in
fication framework to create changelog entries. This imple-         memory during the execution.
mentation had a major drawback: notifications were han-
dled in a dedicated thread, and we could not ensure that
all the notifications were sent to the changelog before its
                                                                    3.2      Dirty Saving
serialization. This behavior could create an inconsistency          Neo4EMF relies on a mapping between EMF entities and
between the in-memory model and the saved one. This is              Neo4j concepts to save its modifications. Figure 2 shows
another reason we do not use the EMF ChangeRecorder                 an excerpt of the Java metamodel, used in the MoDisco [17]
facilities, which relies on notifications.                          project. This metamodel describes Java applications in terms
                                                                    of Packages, ClassDeclarations, BodyDeclarations, and
In this new version, changelog entries are directly created         Comments. A Package is a named container that gathers
into the body of the generated methods. This solution re-           a set of ClassDeclarations through its owned elements
moves synchronization issues and is also more efficient, be-        composition. A ClassDeclaration is composed of a name,
cause entries are created directly, and all the information         a set of Comments and a set of BodyDeclarations.
needed to construct them is available in the method body            Figure 3 shows a simple instance of this metamodel: a Pack-
(current object, feature identifier, new and old values). We        age (package1), containing one ClassDeclaration, (class1).
also do not have to deal with the generic notification API,         This ClassDeclaration contains two Comments (comment1
which was resulting in a lot of casts and complex processing        and comment2) and one single BodyDeclaration (body1).
to retrieve this information. Synchronizing the changelog           Figures 2, 3, and 4 show that:
brings another important benefit: the causality between
model modifications and entries order is ensured and there          Model elements are represented as nodes. Nodes with
is no need to reorder the entry stack before its serialization.        identifier p1, cl1, b1, and com1 are examples corre-
                                                                       sponding to p1, cl1, b1, and com1 in Figure 3. The
Finally, we modify the changelog life cycle. In the previous           root node represents the entry point of the model (the
version, the changelog was a global singleton object, con-             resource directly or indirectly containing all the other
taining the record of a full execution, mixing modifications           elements) and is not associated to a model object.
of multiple resources. This solution is not optimal because
saving is done per resource in EMF, and to save a single re-        Elements attributes are represented as node properties.
source the entire modification stack needed to be processed             Node properties are hname, valuei pairs, where name
to retrieve the corresponding entries. We choose to create a            is the feature identifier and value the value of the fea-
dedicated changelog into each Neo4EMF resource that han-                ture. Node properties can be observed for p1, cl1, and
dles modifications only for the objects contained in the as-            b1.
sociated resource. This modification reduces the complexity
                                                                    Metamodel elements are also represented as nodes and
of the save processing: the resource changelog is simply it-
                                                                        are indexed to facilitate their access. Metamodel nodes
erated and its entries are then serialized into database calls.
                                                                        have two properties: the metaclass name and the meta-
The synchronized aspect of the changelog allows us to pro-
                                                                        model unique identifier. P, Cl, B and Com are ex-
cess the entries in the order they are added, which was not
                                                                        amples of metamodel element nodes, they correspond
possible in the previous version.
                                                                        to PackageDeclaration, ClassDeclaration, Body-
Furthermore, associating a changelog with a resource en-
                                                                        Declaration, and Comment, respectively in Figure 2
                                                                         Figure 1:                  Changelog Metamodel


                                                                                 ChangeLog


                                                                                    Entry


                                                                                 proces s ()


                                         AddLink                                                                               S etAttribute                               RemoveLink
        EObject                                                       NewObject              DeleteObject
                         updatedFeature : EReference                                                                  updatedFeature : EAttribute             updatedFeature : EReference


              BidirectionalAddLink                 UnidirectionalAddLink                                                                       BidirectionalRemoveLink              UnidirectionalRemoveLink


                                                       Figure 4:               Sample instance database representation

                                                                                                                                                              id=b1
                                                                                                                                                                               INS TANCE_OF              id=B
                                                                                                              CLAS S __DECLARATION_BODY_DECLARATIONS      name : 'body1'
                                                                                                                                                                                              name = 'BodyDeclaration'
                                                                                                                                                                                                ns URI = 'http://java'
              IS _ROOT          id = p1              PACKAGE__OWNED_ELEMENTS             id = cl1                 CLAS S __DECLARATION_COMMENTS             id=com1
      ROOT
                           name : 'package1'                                         name : 'clas s 1'            CLAS S __DECLARATION_COMMENTS        content : 'comment1'    INS TANCE_OF
                                                                                                                                                                                                         id=P
                                    INS TANCE_OF                                             INS TANCE_OF                                                                      INS TANCE_OF      name = 'Comment'
                                                                                                                                                            id=com2                              ns URI = 'http://java'
                                  id=P                                                    id=Cl                                                        content : 'comment2'

                          name = 'Package'                                      name = 'Clas s Declaration'
                          ns URI = 'http://java'                                   ns URI = 'http://java'


 InstanceOf relationships are outgoing relationships be-                                                          be entirely persisted, and there is no reason to record their
     tween the elements nodes and the nodes representing                                                          modifications before their first serialization (the final state
     metaclasses. They represent the conformance of an                                                            of the object is the one that needs to be persisted). This full
     object instance to its class definition                                                                      serialization behavior has the advantage of generating only
                                                                                                                  one single entry for a new object, independently from the
 References between objects are represented as relation-                                                          number of its modified fields.
    ships. To avoid naming conflicts relationships are named
    using the following convention:                                                                               This approach works well for small models, but has issues
    class name reference name.                                                                                    when a large modification set needs to be persisted: the
                                                                                                                  changelog grows indefinitely until the user decides to save
                                                                                                                  it. This is typically the case in reverse engineering, where
When a save is requested, changelog entries are processed to                                                      the extracted objects are first all created in memory and
update the database backend. Each entry is serialized into a                                                      only afterwards they are saved.
database operation. The CreateObject entry corresponds
to the creation of a new node and its meta-information                                                            To address this problem we introduce dirty-saving, a peri-
(instanceof to its meta-class, isRoot if the object is di-                                                        odical save action not requested by the user. The period
rectly contained in the resource). All the fields of the object                                                   is determined by the changelog size, configurable through
are also serialized and directly saved in the database. A Se-                                                     the Neo4EMF resource. Since these save operations are not
tAttribute entry corresponds to an update of the related                                                          requested by the user they have to ensure two properties:
node’s property with the corresponding name. AddLink,
RemoveLink, and their subclasses respectively record the
creation and removal of a relationship, storing the contain-                                                             • Reversibility: if the modifications are canceled or if
ing class and feature name.                                                                                                the user does not want to save a session the database
                                                                                                                           should rollback to an acceptable version. This version
We decide to serialize at the same time a created object                                                                   is either (i) the previous regularly saved database if an
and all its references and attributes. New objects need to                                                                 older version exists or (ii) an empty database.
   • Persistability: if a regular save is requested by the          then calling a dirty save, the database will be updated as in
     user, the temporary objects in the database have to            Figure 6. Note that a Delete relationship has been created
     be definitely persisted. They can then constitute a            because the removed Comment is not contained in the re-
     new acceptable version of the database if a rollback is        source anymore. Red relationships and nodes are indexed
     needed.                                                        respectively in tmp_relationships and tmp_nodes indexes.

                                                                    This example shows that our mapping is built on top of the
We introduce a new mapping for changelog entries with the           existing one: there is no modification done on the previ-
purpose of temporary dirty saving. This mapping is based            ous version, represented with black nodes. This simplifies
on the same entries as the regular mapping but the associ-          the rollback process, which consists of a deletion of all the
ated Neo4j concepts allow the system to easily extract dirty        temporary Neo4j objects.
objects and regular ones. In addition we create two indexes:
tmp_relationships and tmp_nodes which respectively con-
tain the dirty relationships and nodes (i. e., created in a dirty
                                                                    3.3    Object Life Cycle
saving session). Figure 5 summarizes the mapping between            We modify the Neo4EMF object life cycle to enable unload-
changelog entries and neo4j concepts:                               ing. When a dirty saving is invoked, all the modifications
                                                                    contained in the changelog are committed to the database.
                                                                    Because of this persistence, persisted objects can be safely
   • CreateObject: creation of a new node (as in the reg-           released from memory and reloaded using on-demand load-
     ular saving process) and addition to the tmp_nodes             ing, if needed.
     index.
                                                                    Figure 7 shows the different life cycle states of a Neo4EMF
   • SetAttribute: creation of a dedicated node contain-            object. When a Neo4EMF object is created it is New: it
     ing the dirty attributes. The idea is to keep a stable         has not been persisted into the database and cannot be re-
     version (i. e., the previous regularly saved version) to       leased. When a save is requested or a dirty save is invoked,
     easily reverse it. A SetAttribute relationship is cre-         the new object is persisted into the database and it is tagged
     ated to link the base object and its attribute node            as Clear: all the known modifications related to the object
                                                                    have been saved and it is fetchable from the database with-
   • AddLink: creation of a generic AddLink relation-               out information loss. In this state the object can be removed
     ship, containing the reference identifier as a property.       from memory without consistency issues. When a modifica-
     This special relationship format is needed to easily pro-      tion is done on the object (setting an attribute or updating
     cess dirty relationships and retrieve their correspond-        a reference) then it is tagged as Modified.
     ing image if a regular save operation is requested
                                                                    Modified objects cannot be released, because their database-
   • RemoveLink: creation of a generic RemoveLink re-
                                                                    mapped nodes do not contain the modified information. When
     lationship, containing the reference identifier as a prop-
                                                                    a save is processed, the Modified objects revert to Clear
     erty. AddLink and RemoveLink relationships with
                                                                    state and can be released again. Loading objects also have
     the same reference identifier and target object are mu-
                                                                    a particular state that avoids garbage collection of an object
     tually exclusive to limit the number of temporary ob-
                                                                    when it is loading.
     jects into the database

   • DeleteObject: creation of a special Delete relation-
                                                                           Figure 7:    Neo4EMF EObject life cycle
     ship looping on the related node. The base version of
     the node is kept alive if a rollback is needed.


The objective of this mapping is to preserve all the infor-
mation contained after a regular save, to easily handle a
rollback. That is why object deletion is done using a re-
lationship: if the modifications are aborted it is simpler to
remove the relationship than creating a new instance of the
node with backup information. We do not use a property to
tag deleted objects for performance reasons (access to node
properties is slower than edge navigation).
To persist definitely dirty objects in the database into regu-
larly saved ones a serialization process is invoked. As changelog
entries, each Neo4j element contains all the information needed
to create their regular equivalents: new objects are simply
removed from the tmp_nodes index, AddLink relationships
are turned into their regular version using their properties
and RemoveLink entries correspond to the deletion of their
existing regular version.                                           To allow garbage collection of Neo4EMF objects, we use
                                                                    Java Soft and Weak references to store object’s fields. Weak
For example if we update the model given in Figure 3 by re-         and Soft referenced objects are eligible for garbage collection
moving com1 and creating a new BodyDeclaration body2                as soon as there is no strong reference chain on them. The
                                                 Figure 5:     Changelog to Neo4j entity mapping


                                                                                                      Neo4j::Relations hipType

                                                                    AddLink                      + name : S tring = "AddLink"
                                                                                                     + relName : S tring


                                                                                                      Neo4j::Relations hipType
                                                                  DeleteObject
                                                                                                 + name : S tring = "Delete"


                                                                                                      Neo4j::Relations hipType
                           ChangeLog               Entry          S etAttribute
                                                                                              + name : S tring = "S etAttribute"


                                                                                                      Neo4j::Relations hipType
                                                                   RemoveLink
                                      1..*                                                    + name : S tring = "RemoveLink"                Neo4j::Node
                                                                                                   + relName : S tring
                            EObject


                                                                   NewObject


                                                  Figure 6: Database state after modifications


                                                                                                                                                                  id=b1
                                                                                                             CLAS S __DECLARATION_BODY_DECLARATIONS
                                                                                                                                                              name : 'body1'


                                                                                                                                                                  Delete


                                                                                                                  CLAS S __DECLARATION_COMMENTS
                                                                                                                                                                id=com1

                                  id = p1                                             id = cl1                                 RemoveLink                  content : 'comment1'
                IS _ROOT                           PACKAGE__OWNED_ELEMENTS                                      rel='CLAS S __DECLARATION_COMMENTS '
       ROOT
                             name : 'package1'                                    name : 'clas s 1'
                                                                                                                  CLAS S __DECLARATION_COMMENTS                 id=com2

                                                                                                                                                           content : 'comment2'
                                                                                                                               AddLink
                                                                                                           rel='CLAS S __DECLARATION_BODY_DECLARATIONS '

                                                                                                                                                                  id=b2

                                                                                                                                                              name : 'body2'


difference between the two kinds of references is the time                                  database manages its objects life cycle through a policy de-
they can remain in memory. Weak references are collected                                    fined at the resource creation (memory or performance pref-
as soon as possible by the garbage collector, whereas Soft                                  erences).
references can be retained in memory as long as the garbage
collector does not need to free them (i.e., as long as there                                3.4           Extended On-Demand Loading
is enough available memory). This particular behavior is                                    To handle the new architecture of our layer, we have to ex-
interesting for cache implementation and to optimize execu-                                 tend the on-demand loading feature to support temporary
tion speed in a large available memory context. Reference                                   persisted objects. On-demand loading uses two parameters:
type (Weak or Soft) can be set through Neo4EMF resource                                     (i) the object that handles the feature to load and (ii) the
parameters.                                                                                 identifier of the feature to load. This behavior implies that
                                                                                            a Neo4EMF object is always loaded from another Neo4EMF
In Section 3.1, we describe that changelog entries contain all                              object.
the information related to the serialization of the concerned
object. This information constitutes the strong reference
chain on the related object fields. When a save is done, en-                                Figure 6 shows our Java metamodel instance state after a
tries are processed and deleted, breaking the strong reference                              dirty save. The database content is a mix between regularly
chain and making objects eligible for garbage collection.                                   saved objects (in black) and dirty-saved ones (in red). Load-
                                                                                            ing referenced Comments instances from ClassDeclara-
Neo4j’s objects are not impacted by this new life-cycle. The                                tion cl1 is done in three steps to ensure the last dirty-saved
operations have been considered.                                        Persistence Layer         XMI          CDO        Neo4EMF
First, class declaration comments relationships are pro-               #Created Elements       22 939 780    4 378 990   >40 000 0001
cessed and their end nodes are saved. Second, the AddLink
relationships containing the corresponding rel property are        Table 1:  Number of Created Elements Before
processed and their end nodes are added to the previous            Memory Overhead
ones. This operation retrieves all the associated nodes for
the given feature, regular ones and dirty ones. Third, Re-
moveLink relationships are processed the same way and
their end nodes are removed from the loaded node set.

Attribute fetching behavior is a bit different: if a node repre-
senting an object has relationships to a dedicated attribute
node, then the data contained in this node is returned in-
stead of the base node property.

To improve the performances of our layer, we create a cache
that maps Neo4j identifiers to their associated object. When
on-demand loading is performed, the cache is checked first,
avoiding the cost of a database access. This cache is also
used to retrieve released objects.

4.    EVALUATION
In this section, we evaluate how the memory footprint and          Figure 8: Memory Consumption: Model Traversal
the access time of Neo4EMF scale in different large model          and Save (20 MB)
scenarios, and we compare it against CDO and XMI. These
experiments are performed over two EMF model extracted
with the MoDisco Java Discoverer [17]. Both models are ex-         Note that the number given for Neo4EMF is an approxi-
tracted from Eclipse plug-ins: the first one is an internal tool   mation: we stop the execution before any OutOfMemory
and the second one is the Eclipse JDT plugin. The result-          error. The average memory used to create elements was
ing XMI files are 20 MB and 420 MB, containing respectively        around 500 MB and does not seem to grow. This perfor-
around 80 000 and 1 700 000 elements.                              mance is due to the dirty-saving mechanism: created ob-
                                                                   jects generate entries in the changelog. When the changelog
4.1    Execution Environment                                       is full, changes are saved temporarily in the database, freeing
Experiments are executed on a computer running Windows             the changelog for next object creations.
7 professional edition 64 bits. Interesting hardware ele-
ments are: an Intel Core I5 processor 3350P (3.5 GHz), 8 GB        Experiment 2: Model traversal. In this experiment, we
of DDR3 SDRAM (1600 MHz) and a Seagate barracuda                   load a model and execute a traversal query that starts from
7200.14 hard disk (6 GB/s). Experiments are executed on            the root of the model, traverses all the containment tree and
Eclipse 4.3 running Java SE Runtime Environment 1.8.               modifies the name attribute of all NamedElements. All
                                                                   the modifications are saved at the end of the execution. Dur-
To compare the three persistence solutions, we generate            ing the traversal, we measure the execution time for covering
three different EMF models from the MoDisco Java Meta-             the entire model and the average memory used to perform
model: (i) the standard EMF model, (ii) the CDO one and            the query. In addition, we measure the memory needed to
(iii) the Neo4EMF one. We import both models from XMI              save the modifications at the end of the execution. Fig-
to CDO and Neo4EMF and we verify they contain the same             ures 8 and 9 summarize memory results. As expected, the
data after the import.                                             Neo4EMF traversal footprint is higher than the XMI one be-
                                                                   cause we include the Neo4j embedded database and runtime
Neo4EMF uses an embedded Neo4j database to store its               in our measures. Unloading brings a real interest when com-
objects. To provide a meaningful comparison in term of             paring the results with CDO: when removing unused (i. e.,
memory consumption we choose to use an embedded CDO                unreferenced) objects we save space and process the request
server.                                                            in a reduced amount of memory. For this experiment we
                                                                   use a 4 GB Java virtual machine, with the ConcMarkSweepGC
Experiment 1: Object creation. In this first exper-                garbage collector, recommended when using Neo4j.
iment, we execute an infinite loop of object creation and
simply count how many objects have been created before a           Experiment 3: Time performance. This experiment is
OutOfMemoryException is thrown. We choose a sim-                   similar to the previous one, but we focus on time perfor-
ple tree structure of three classes to instantiate from the        mances. We measure the time needed to perform traversal
MoDisco Java metamodel: a parent ClassFile containing              and save. Figures 10 and 11 summarize the results. To
1000 BlockComment and ImportDeclaration. The re-                   provide a fair comparison between full and on-demand load-
sulting model is a set of independent element trees. For this      ing strategies we also include model loading time with the
experiments we choose a 1 GB Java virtual machine and an           traversal queries.
arbitrarily fixed changelog size of 100 000 entries. Table 1
                                                                   1
summarizes the results.                                                The execution was stopped before any memory exception.
Figure 9: Memory Consumption: Model Traversal                    Figure 11:    420 MB traversal and save performances
and Save (420 MB)

                                                                 We also run our benchmarks on different operating sys-
                                                                 tems (Ubuntu 12.04 and 13.10) and we find that CDO and
                                                                 Neo4EMF time performances seem to be linked to the file
                                                                 partition format (especially in I/O accesses): Neo4j has bet-
                                                                 ter performances on these operating system (with a factor
                                                                 of 1.5) and CDO has slower times (with approximately the
                                                                 same factor). More investigation is needed to optimize our
                                                                 tool in different contexts.

                                                                 Our experiments show that Neo4EMF is an interesting al-
                                                                 ternative to CDO to handle large models in memory con-
                                                                 strained environment. On-demand loading and transpar-
                                                                 ent unloading offer a small memory footprint (smaller than
                                                                 CDO in our experiments), but our solution does not provide
                                                                 advanced features like collaborative edition and versioning
                                                                 provided by CDO.

Figure 10:     20 MB model traversal and save perfor-            The unload strategy is transparent for the user, but may be
mances                                                           intrusive in some cases, for instance if the hard-drive mem-
                                                                 ory space is limited or the time performances are critical.
                                                                 This is why we introduce configuration for dirty saving and
                                                                 changelog size through the Neo4EMF resource.

Neo4EMF save performances can be explained with dirty-           5.   RELATED WORK
saving: during the traversal, entries are generated to track     Models obtained by reverse engineering with EMF-based
the name modifications. These entries are then saved in the      tools such as MoDisco [17, 5, 11] can be composed of mil-
database when the changelog is full, reducing the final save     lions of elements. Existing solutions to handle this kind of
cost. This behavior also explains a part of the traversal time   models have shown clear limitations in terms of memory
overhead, when compared to CDO: Neo4EMF traversal im-            consumption and processing.
plies database write access for dirty saving where CDO does
not, related I/O accesses considerably impact performance.       CDO is the de facto standard to handle large models using
                                                                 a server and a relational database. However, some exper-
                                                                 iments have shown that CDO does not scale well to very
4.2   Discussion                                                 large models [2]. Pagán et al. [14, 15] propose to use NoSQL
The results of these experiments show that dirty-saving cou-     databases to store models, especially because those kind of
pled with on-demand loading decrease significantly the mem-      databases should fit better to the interconnected nature of
ory needed to execute a query. As expected, this memory          EMF models.
footprint improvement worsens the time performances of our
tool, in particular because of dirty-saving, which generates     Mongo EMF [7] is a NoSQL approach that stores EMF mod-
several database calls. That is why we provide dirty sav-        els in MongoDB, a document-oriented database. However,
ing configuration through the Neo4EMF resource. The ex-          Mongo EMF storage is different from the standard EMF
periments also show that Neo4EMF is able to handle large         persistence backend, and cannot be used as is to replace an
queries and modifications in a limited amount of memory,         other persistence solution in an existing system. Modifica-
compared to existing solutions.                                  tions on the client software are needed to integrate it.
Morsa [14] is an other persistence solution based on Mon-         [4] L. Bettini. Implementing Domain-Specific Languages
goDB database. Similarly to Neo4EMF, Morsa uses a stan-               with Xtext and Xtend. 2013.
dard EMF mechanism to ensure persistence, but it uses a           [5] H. Bruneliere, J. Cabot, G. Dupé, and F. Madiot.
client-server architecture, like CDO. Morsa has some sim-             Modisco: A model driven reverse engineering
ilarities with Neo4EMF, notably in its on-demand loading              framework. Information and Software Technology,
mechanism, but does not use a graph database.                         56(8):1012 – 1032, 2014.
                                                                  [6] H. Bruneliere, J. Cabot, F. Jouault, and F. Madiot.
EMF Fragments [10] is another EMF persistence layer based             Modisco: A generic and extensible framework for
on a NoSQL database. The EMF Fragments approach is dif-               model driven reverse engineering. In Proceedings of the
ferent from other NoSQL persistence solutions: it relies on           IEEE/ACM International Conference on Automated
the proxy mechanism provided by EMF. Models are auto-                 Software Engineering, ASE ’10, pages 173–174, New
matically partitioned and loading is performed by partition.          York, NY, USA, 2010. ACM.
Loading on demand is only performed for cross-partition           [7] Bryan Hunt. MongoEMF, 2014. url:
references. Another difference with Neo4EMF is that EMF               https://github.com/BryanHunt/mongo-emf/wiki/.
Fragments needs to annotate the metamodels to provide the         [8] Eclipse Foundation. The CDO Model Repository
partition set, whereas our approach does not require model            (CDO), 2014. url: http://www.eclipse.org/cdo/.
adaptation or tool modification.
                                                                  [9] INRIA and LINA. ATLAS transformation language,
                                                                      2014.
6.   CONCLUSION AND FUTURE WORK                                  [10] Markus Scheidgen. EMF fragments, 2014. url: https:
In this paper, we presented a strategy to optimize the mem-           //github.com/markus1978/emf-fragments/wiki/.
ory footprint of Neo4EMF, a persistence layer designed to        [11] Modeliosoft Solutions, 2014. url:
handle large models through on-demand loading and trans-              http://www.modeliosoft.com/.
parent unloading. Our experiments show that Neo4EMF is           [12] J. Musset, É. Juliot, S. Lacrampe, W. Piers, C. Brun,
an interesting alternative to CDO for accessing and query-            L. Goubet, Y. Lussaud, and F. Allilaire. Acceleo user
ing large models, especially in small available memory con-           guide, 2006.
text, with a tolerable performance loss. Neo4EMF does not        [13] OMG. MOF 2.0 QVT final adopted specification
have collaborative model editing or model versioning fea-             (ptc/05-11-01), April 2008.
tures, which biases our results: providing those features may    [14] J. E. Pagán, J. S. Cuadrado, and J. G. Molina. Morsa:
imply a more important memory consumption.                            A scalable approach for persisting and accessing large
                                                                      models. In Proceedings of the 14th International
In future work, we plan to improve our layer by providing             Conference on Model Driven Engineering Languages
partial collection loading, allowing the loading of large col-        and Systems, MODELS’11, pages 77–92, Berlin,
lections subparts from the database. In our experiments, we           Heidelberg, 2011. Springer-Verlag.
detected some memory consumption overhead in this par-
                                                                 [15] J. E. Pagán and J. G. Molina. Querying large models
ticular case: when an object contains a huge number of ref-
                                                                      efficiently. Information and Software Technology, 2014.
erenced objects (through the same reference) and they are
                                                                      In press, accepted manuscript. url:
all loaded at once.
                                                                      http://dx.doi.org/10.1016/j.infsof.2014.01.005.
We then plan to study the inclusion of attribute and refer-      [16] J. Partner, A. Vukotic, and N. Watt. Neo4j in Action.
ence meta-information directly in the database to avoid un-           O’Reilly Media, 2013.
necessary object loading: some EMF mechanisms, like is-          [17] The Eclipse Foundation. MoDisco Eclipse Project,
Set may induce load on demand of the associated attribute,            2014. url: http://www.eclipse.org/MoDisco/.
just in order to make a comparison. It could be interest-
ing to provide this information from the database without a
complete and costly object loading.

Finally, we want to introduce loading strategies such as
prefetching or model partitioning (using optional metamodel
annotations or a definition of the model usage) to allow users
to customize the object life cycle.


7.   REFERENCES
 [1] AtlanMod. Neo4EMF, 2014. url:
     http://www.neo4emf.com/.
 [2] K. Barmpis and D. S. Kolovos. Comparative analysis
     of data persistence technologies for large-scale models.
     In Proceedings of the 2012 Extreme Modeling
     Workshop, XM ’12, pages 33–38, New York, NY, USA,
     2012. ACM.
 [3] A. Benelallam, A. Gómez, G. Sunyé, M. Tisi, and
     D. Launay. Neo4emf, a scalable persistence layer for
     emf models. July 2014.

</pre>