<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Improving Memory Efficiency for Processing Large-Scale Models</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Massimo Tisi AtlanMod team (Inria</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mines Nantes</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>LINA) massimo.tisi@inria.fr</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Amine Benelallam AtlanMod team</institution>
          ,
          <addr-line>Inria, Mines Nantes, LINA</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>General Terms Performance</institution>
          ,
          <addr-line>Algorithms</addr-line>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Gerson Sunyé AtlanMod team</institution>
          ,
          <addr-line>Inria, Mines Nantes, LINA</addr-line>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>Gwendal Daniel AtlanMod team (Inria</institution>
          ,
          <addr-line>Mines Nantes, LINA) nantes.fr</addr-line>
        </aff>
      </contrib-group>
      <abstract>
        <p>Scalability is a main obstacle for applying Model-Driven Engineering to reverse engineering, or to any other activity manipulating large models. Existing solutions to persist and query large models are currently ine cient and strongly linked to memory availability. In this paper, we propose a memory unload strategy for Neo4EMF, a persistence layer built on top of the Eclipse Modeling Framework and based on a Neo4j database backend. Our solution allows us to partially unload a model during the execution of a query by using a periodical dirty saving mechanism and transparent reloading. Our experiments show that this approach enables to query large models in a restricted amount of memory with an acceptable performance.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Scalability</kwd>
        <kwd>Large models</kwd>
        <kwd>Memory footprint</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Categories and Subject Descriptors</title>
      <p>D.2.2 [Software Engineering]: Design Tools and
Techniques</p>
    </sec>
    <sec id="sec-2">
      <title>1. INTRODUCTION</title>
      <p>
        The Eclipse Modeling Framework (EMF) is the de facto
standard for the Model Driven Engineering (MDE)
community. This framework provides a common base for
multiple purposes and associated tools: code generation [
        <xref ref-type="bibr" rid="ref12 ref4">4, 12</xref>
        ],
model transformation [
        <xref ref-type="bibr" rid="ref13 ref9">9, 13</xref>
        ], and reverse engineering [
        <xref ref-type="bibr" rid="ref17 ref5 ref6">17, 6,
5</xref>
        ].
      </p>
      <p>BigMDE ’14 July 24, 2014. York, UK.</p>
      <p>Copyright c 2014 for the individual papers by the papers’ authors.
Copying permitted for private and academic purposes. This volume is published
and copyrighted by its editors.</p>
      <p>These tools handle complex and large-scale models when
manipulating important applications, for example, during
reverse-engineering or software modernization through model
transformation. EMF was rst designed to support
modeling tools and has shown limitations in handling large models.
A more e cient persistence solution is needed to allow for
partial model loading and unloading, which are key points
when dealing with large models.</p>
      <p>
        While several solutions to persist EMF models exist, most of
them do not allow partial model unloading and cannot
handle models that exceed the available memory. Furthermore,
these solutions do not take advantage of the graph nature
of the models: most of them rely on relational databases,
which are not fully adapted to store and query graphs.
Neo4EMF [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] is a persistence layer for EMF that relies on
a graph database and implements an unloading mechanism.
In this paper, we present a strategy to optimize the
memory footprint of Neo4EMF. To evaluate this strategy, we
perform a set of queries on Neo4EMF and compare them
against two other persistence mechanisms, XMI and CDO.
We measure performances in terms of memory consumption
and execution time.
      </p>
      <p>The paper is organized as follows: Section 2 presents the
background and the motivations for our unloading strategy.
Section 3 describes our strategy and its main concepts: dirty
saving, unloading, and extended on-demand loading.
Section 4 evaluates the performance of our persistence layer.
Section 5 compares our approach with existing solutions and
nally, Section 6 concludes and draws the future
perspectives of the tool.</p>
    </sec>
    <sec id="sec-3">
      <title>2. BACKGROUND</title>
    </sec>
    <sec id="sec-4">
      <title>2.1 EMF Persistence</title>
      <p>As many other modeling tools, EMF has adopted XMI as
its default serialization format. This XML-based
representation has the advantage to be human readable, but has
two drawbacks: (i) XMI sacri ces compactness for an
understandable output and (ii) XMI les have to be entirely
parsed to get a readable and navigational model. The former
drawback reduces e ciency of I/O access, while the latter
increases the memory needed to load a model and limits
on-demand loading and proxy uses between les. XMI does
not provide advanced features such as model versioning or
concurrent modi cations.</p>
      <p>
        The CDO [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] model repository was built to solve those
problems. It was designed as a framework to manage large
models in a collaborative environment with a small memory
footprint. CDO relies on a client-server architecture supporting
transactional accesses and noti cations. CDO servers are
built on top of several persistence solutions, but in practice
only relational databases are used to store CDO objects.
      </p>
    </sec>
    <sec id="sec-5">
      <title>2.2 Graph Databases</title>
      <p>Graph databases are one of the NoSQL data models that
have emerged to overcome the limitations of relational databases
with respect to scale and distribution. NoSQL databases do
not ensure ACID properties, but in return, they are able to
handle e ciently large-scale data in a distributed
environment.</p>
      <p>
        Graph databases are based on nodes, edges, and
properties. This particular data representation ts exactly to EMF
models, which are intrinsically graphs (each object can be
seen as a node and references as edges). Thus, graph databases
can store EMF models without a complex serialization
process.
3. NEO4EMF
Neo4EMF is a persistence layer built on top of the EMF
framework that aims at handling large-models in a
scalable way. It provides a compatible EMF API and a
graphdatabase persistence backend based on Neo4j [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ].
      </p>
      <p>
        Neo4EMF is open source and distributed under the terms
of the (A)GPLv3 [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
      </p>
      <p>
        In previous work [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], we introduced the basic concepts of
Neo4EMF : model change tracking and on-demand loading.
Model change tracking is based on a global changelog that
stores the modi cations done on a model during an
execution (from creation to save). Tracking the modi cations is
done using EMF noti cation facilities: the changelog acts
as a listener for all the objects and creates its entries from
the received noti cations. Neo4EMF uses an on-demand
loading mechanism to load object elds only when they are
accessed. Technically, each Neo4EMF object is instantiated
as an empty container. When one of its elds (EReferences
and EAttributes) is accessed, the associated content is
loaded. This mechanism presents two advantages: (i) the
entire model does not have to be loaded at once and (ii)
unused elements are not loaded.
      </p>
      <p>Neo4EMF does not use the EStore mechanism. Indeed,
EStore allows the EObject data storage to be changed by
providing a stateless object that translates model modi
cations and accesses into backend calls. Every generated
accessor and modi er delegates to the re exive API. As
a consequence, EObjects have to fetch through the store
each time a eld is requested, engendering several database
queries. On the contrary, Neo4EMF is based on regular
EObjects (with in-memory elds) which are synchronized
with a database backend.</p>
      <p>In this paper we focus on Neo4EMF memory footprint. We
introduce a strategy to unload some parts of a processed
model and save memory during a query execution. In the
previous implementation, the on-demand loading mechanism
allows us to load only the parts of the model that are needed,
but there is no solution to remove unneeded objects from
memory, especially when they were changed but not saved
yet.</p>
      <p>A reliable unload strategy needs to address two main issues:
Accessibility: Contents of unloaded objects (attributes
and referenced objects) have to remain accessible through
standard EMF accessors.</p>
      <p>Transparency: The management of the object life
cycle has to be independent from users, but
customizable to t speci c needs, e. g., size of the Java virtual
machine, requirements on execution time, etc.</p>
      <p>Our strategy faces these issues by providing a dirty-saving
mechanism, which provides temporary and transparent model
persistence. The object life cycle has also been modi ed to
include unloading of persisted elements.</p>
      <p>In this next sections, we provide an overview of the changelog
used to record the modi cations of the processed model.
Then, we present dirty saving, based on the basic Neo4EMF
save mechanism, and we describe the Neo4EMF object life
cycle. Finally, we describe the modi cations done on the
on-demand loading feature to handle this new strategy.</p>
    </sec>
    <sec id="sec-6">
      <title>3.1 Neo4EMF Changelog</title>
      <p>Neo4EMF needs a mechanism to ensure synchronization
between the in-memory model and its backend representation,
avoiding systematic unnecessary calls to the database.
Despite the existence in EMF of a modi cation tracking
mechanism, the ChangeRecorder class, we decided to
develop an alternative solution that minimizes memory
consumption.</p>
      <p>Neo4EMF tracks model modi cations in a changelog, a
sequence of entries of ve types:
Object creation: A new object has been created and
attached to a Neo4EMF resource.</p>
      <p>Object deletion: An object has been deleted or removed
from a Neo4EMF resource.</p>
      <p>Attribute modi cations: Attribute setting and unsetting.
We distinguish unidirectional and bidirectional reference
modi cations for performance reasons (they are not serialized the
same way during the saving process).</p>
      <p>Figure 1 summarizes our changelog model. All changelog
entries are subclasses of Entry, which de nes some shared
properties: the object concerned by the modi cation (for
instance the object containing a modi ed attribute or
reference, or the new object in case of a CreateObject entry)
and a basic serialization method.</p>
      <p>Attribute and reference modi cation entries (SetAttribute,
AddLink, RemoveLink and their subclasses) have three
additional elds to track ne-grained modi cations: the
updated feature (attribute or reference identi er) which
corresponds to the modi ed eld of the concerned object, the
new and old values of the feature (if available).</p>
      <p>This decomposition provides a direct access to the
information required during the serialization process, without
accessing the concerned objects. The ne-grained entry
management also decreases memory consumption. For instance
modi cations on bidirectional references correspond to a
single changelog entry, while they needed two basic entries
before. Serialization of those entries is also more e cient since
it reduces the number of database accesses.</p>
      <p>In the previous version of Neo4EMF, we used the EMF
notication framework to create changelog entries. This
implementation had a major drawback: noti cations were
handled in a dedicated thread, and we could not ensure that
all the noti cations were sent to the changelog before its
serialization. This behavior could create an inconsistency
between the in-memory model and the saved one. This is
another reason we do not use the EMF ChangeRecorder
facilities, which relies on noti cations.</p>
      <p>In this new version, changelog entries are directly created
into the body of the generated methods. This solution
removes synchronization issues and is also more e cient,
because entries are created directly, and all the information
needed to construct them is available in the method body
(current object, feature identi er, new and old values). We
also do not have to deal with the generic noti cation API,
which was resulting in a lot of casts and complex processing
to retrieve this information. Synchronizing the changelog
brings another important bene t: the causality between
model modi cations and entries order is ensured and there
is no need to reorder the entry stack before its serialization.
Finally, we modify the changelog life cycle. In the previous
version, the changelog was a global singleton object,
containing the record of a full execution, mixing modi cations
of multiple resources. This solution is not optimal because
saving is done per resource in EMF, and to save a single
resource the entire modi cation stack needed to be processed
to retrieve the corresponding entries. We choose to create a
dedicated changelog into each Neo4EMF resource that
handles modi cations only for the objects contained in the
associated resource. This modi cation reduces the complexity
of the save processing: the resource changelog is simply
iterated and its entries are then serialized into database calls.
The synchronized aspect of the changelog allows us to
process the entries in the order they are added, which was not
possible in the previous version.</p>
      <p>Furthermore, associating a changelog with a resource
ensures that, when the resource is deleted, all the related
entries are also deleted. In the previous version, entries could
not be deleted from the global changelog, and were kept in
memory during the execution.</p>
    </sec>
    <sec id="sec-7">
      <title>3.2 Dirty Saving</title>
      <p>
        Neo4EMF relies on a mapping between EMF entities and
Neo4j concepts to save its modi cations. Figure 2 shows
an excerpt of the Java metamodel, used in the MoDisco [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ]
project. This metamodel describes Java applications in terms
of Packages, ClassDeclarations, BodyDeclarations, and
Comments. A Package is a named container that gathers
a set of ClassDeclarations through its owned elements
composition. A ClassDeclaration is composed of a name,
a set of Comments and a set of BodyDeclarations.
Figure 3 shows a simple instance of this metamodel: a
Package (package1), containing one ClassDeclaration, (class1).
This ClassDeclaration contains two Comments (comment1
and comment2) and one single BodyDeclaration (body1).
Figures 2, 3, and 4 show that:
Model elements are represented as nodes. Nodes with
identi er p1, cl1, b1, and com1 are examples
corresponding to p1, cl1, b1, and com1 in Figure 3. The
root node represents the entry point of the model (the
resource directly or indirectly containing all the other
elements) and is not associated to a model object.
Elements attributes are represented as node properties.
      </p>
      <p>Node properties are hname; valuei pairs, where name
is the feature identi er and value the value of the
feature. Node properties can be observed for p1, cl1, and
b1.</p>
      <p>Metamodel elements are also represented as nodes and
are indexed to facilitate their access. Metamodel nodes
have two properties: the metaclass name and the
metamodel unique identi er. P, Cl, B and Com are
examples of metamodel element nodes, they correspond
to PackageDeclaration, ClassDeclaration,
BodyDeclaration, and Comment, respectively in Figure 2
InstanceOf relationships are outgoing relationships
between the elements nodes and the nodes representing
metaclasses. They represent the conformance of an
object instance to its class de nition
References between objects are represented as
relationships. To avoid naming con icts relationships are named
using the following convention:
class name reference name.</p>
      <p>When a save is requested, changelog entries are processed to
update the database backend. Each entry is serialized into a
database operation. The CreateObject entry corresponds
to the creation of a new node and its meta-information
(instanceof to its meta-class, isRoot if the object is
directly contained in the resource). All the elds of the object
are also serialized and directly saved in the database. A
SetAttribute entry corresponds to an update of the related
node's property with the corresponding name. AddLink,
RemoveLink, and their subclasses respectively record the
creation and removal of a relationship, storing the
containing class and feature name.</p>
      <p>We decide to serialize at the same time a created object
and all its references and attributes. New objects need to
CLASS__DECLARATION_BODY_DECLARATIONS</p>
      <p>CLASS__DECLARATION_COMMENTS
CLASS__DECLARATION_COMMENTS
be entirely persisted, and there is no reason to record their
modi cations before their rst serialization (the nal state
of the object is the one that needs to be persisted). This full
serialization behavior has the advantage of generating only
one single entry for a new object, independently from the
number of its modi ed elds.</p>
      <p>This approach works well for small models, but has issues
when a large modi cation set needs to be persisted: the
changelog grows inde nitely until the user decides to save
it. This is typically the case in reverse engineering, where
the extracted objects are rst all created in memory and
only afterwards they are saved.</p>
      <p>To address this problem we introduce dirty-saving, a
periodical save action not requested by the user. The period
is determined by the changelog size, con gurable through
the Neo4EMF resource. Since these save operations are not
requested by the user they have to ensure two properties:
Reversibility: if the modi cations are canceled or if
the user does not want to save a session the database
should rollback to an acceptable version. This version
is either (i) the previous regularly saved database if an
older version exists or (ii) an empty database.</p>
      <p>Persistability: if a regular save is requested by the
user, the temporary objects in the database have to
be de nitely persisted. They can then constitute a
new acceptable version of the database if a rollback is
needed.</p>
      <p>We introduce a new mapping for changelog entries with the
purpose of temporary dirty saving. This mapping is based
on the same entries as the regular mapping but the
associated Neo4j concepts allow the system to easily extract dirty
objects and regular ones. In addition we create two indexes:
tmp_relationships and tmp_nodes which respectively
contain the dirty relationships and nodes (i. e., created in a dirty
saving session). Figure 5 summarizes the mapping between
changelog entries and neo4j concepts:</p>
      <p>CreateObject: creation of a new node (as in the
regular saving process) and addition to the tmp_nodes
index.</p>
      <p>SetAttribute: creation of a dedicated node
containing the dirty attributes. The idea is to keep a stable
version (i. e., the previous regularly saved version) to
easily reverse it. A SetAttribute relationship is
created to link the base object and its attribute node
AddLink: creation of a generic AddLink
relationship, containing the reference identi er as a property.
This special relationship format is needed to easily
process dirty relationships and retrieve their
corresponding image if a regular save operation is requested
RemoveLink: creation of a generic RemoveLink
relationship, containing the reference identi er as a
property. AddLink and RemoveLink relationships with
the same reference identi er and target object are
mutually exclusive to limit the number of temporary
objects into the database
DeleteObject: creation of a special Delete
relationship looping on the related node. The base version of
the node is kept alive if a rollback is needed.</p>
      <p>The objective of this mapping is to preserve all the
information contained after a regular save, to easily handle a
rollback. That is why object deletion is done using a
relationship: if the modi cations are aborted it is simpler to
remove the relationship than creating a new instance of the
node with backup information. We do not use a property to
tag deleted objects for performance reasons (access to node
properties is slower than edge navigation).</p>
      <p>To persist de nitely dirty objects in the database into
regularly saved ones a serialization process is invoked. As changelog
entries, each Neo4j element contains all the information needed
to create their regular equivalents: new objects are simply
removed from the tmp_nodes index, AddLink relationships
are turned into their regular version using their properties
and RemoveLink entries correspond to the deletion of their
existing regular version.</p>
      <p>For example if we update the model given in Figure 3 by
removing com1 and creating a new BodyDeclaration body2
then calling a dirty save, the database will be updated as in
Figure 6. Note that a Delete relationship has been created
because the removed Comment is not contained in the
resource anymore. Red relationships and nodes are indexed
respectively in tmp_relationships and tmp_nodes indexes.
This example shows that our mapping is built on top of the
existing one: there is no modi cation done on the
previous version, represented with black nodes. This simpli es
the rollback process, which consists of a deletion of all the
temporary Neo4j objects.</p>
    </sec>
    <sec id="sec-8">
      <title>3.3 Object Life Cycle</title>
      <p>We modify the Neo4EMF object life cycle to enable
unloading. When a dirty saving is invoked, all the modi cations
contained in the changelog are committed to the database.
Because of this persistence, persisted objects can be safely
released from memory and reloaded using on-demand
loading, if needed.</p>
      <p>Figure 7 shows the di erent life cycle states of a Neo4EMF
object. When a Neo4EMF object is created it is New: it
has not been persisted into the database and cannot be
released. When a save is requested or a dirty save is invoked,
the new object is persisted into the database and it is tagged
as Clear: all the known modi cations related to the object
have been saved and it is fetchable from the database
without information loss. In this state the object can be removed
from memory without consistency issues. When a modi
cation is done on the object (setting an attribute or updating
a reference) then it is tagged as Modi ed.</p>
      <p>Modi ed objects cannot be released, because their
databasemapped nodes do not contain the modi ed information. When
a save is processed, the Modi ed objects revert to Clear
state and can be released again. Loading objects also have
a particular state that avoids garbage collection of an object
when it is loading.
To allow garbage collection of Neo4EMF objects, we use
Java Soft and Weak references to store object's elds. Weak
and Soft referenced objects are eligible for garbage collection
as soon as there is no strong reference chain on them. The
ChangeLog</p>
      <p>Entry</p>
      <p>SetAttribute
EObject
1..*</p>
      <p>AddLink
DeleteObject
RemoveLink
NewObject</p>
      <p>Neo4j::RelationshipType
+ name : String = "AddLink"</p>
      <p>+ relName : String
Neo4j::RelationshipType
+ name : String = "Delete"</p>
      <p>Neo4j::RelationshipType
+ name : String = "SetAttribute"</p>
      <p>Neo4j::RelationshipType
+ name : String = "RemoveLink"
+ relName : String</p>
      <p>Neo4j::Node
di erence between the two kinds of references is the time
they can remain in memory. Weak references are collected
as soon as possible by the garbage collector, whereas Soft
references can be retained in memory as long as the garbage
collector does not need to free them (i.e., as long as there
is enough available memory). This particular behavior is
interesting for cache implementation and to optimize
execution speed in a large available memory context. Reference
type (Weak or Soft) can be set through Neo4EMF resource
parameters.</p>
      <p>In Section 3.1, we describe that changelog entries contain all
the information related to the serialization of the concerned
object. This information constitutes the strong reference
chain on the related object elds. When a save is done,
entries are processed and deleted, breaking the strong reference
chain and making objects eligible for garbage collection.
Neo4j's objects are not impacted by this new life-cycle. The
database manages its objects life cycle through a policy
dened at the resource creation (memory or performance
preferences).</p>
    </sec>
    <sec id="sec-9">
      <title>3.4 Extended On-Demand Loading</title>
      <p>To handle the new architecture of our layer, we have to
extend the on-demand loading feature to support temporary
persisted objects. On-demand loading uses two parameters:
(i) the object that handles the feature to load and (ii) the
identi er of the feature to load. This behavior implies that
a Neo4EMF object is always loaded from another Neo4EMF
object.</p>
      <p>Figure 6 shows our Java metamodel instance state after a
dirty save. The database content is a mix between regularly
saved objects (in black) and dirty-saved ones (in red).
Loading referenced Comments instances from
ClassDeclaration cl1 is done in three steps to ensure the last dirty-saved
operations have been considered.</p>
      <p>First, class declaration comments relationships are
processed and their end nodes are saved. Second, the AddLink
relationships containing the corresponding rel property are
processed and their end nodes are added to the previous
ones. This operation retrieves all the associated nodes for
the given feature, regular ones and dirty ones. Third,
RemoveLink relationships are processed the same way and
their end nodes are removed from the loaded node set.
Attribute fetching behavior is a bit di erent: if a node
representing an object has relationships to a dedicated attribute
node, then the data contained in this node is returned
instead of the base node property.</p>
      <p>To improve the performances of our layer, we create a cache
that maps Neo4j identi ers to their associated object. When
on-demand loading is performed, the cache is checked rst,
avoiding the cost of a database access. This cache is also
used to retrieve released objects.</p>
    </sec>
    <sec id="sec-10">
      <title>4. EVALUATION</title>
      <p>
        In this section, we evaluate how the memory footprint and
the access time of Neo4EMF scale in di erent large model
scenarios, and we compare it against CDO and XMI. These
experiments are performed over two EMF model extracted
with the MoDisco Java Discoverer [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ]. Both models are
extracted from Eclipse plug-ins: the rst one is an internal tool
and the second one is the Eclipse JDT plugin. The
resulting XMI les are 20 MB and 420 MB, containing respectively
around 80 000 and 1 700 000 elements.
      </p>
    </sec>
    <sec id="sec-11">
      <title>4.1 Execution Environment</title>
      <p>Experiments are executed on a computer running Windows
7 professional edition 64 bits. Interesting hardware
elements are: an Intel Core I5 processor 3350P (3:5 GHz), 8 GB
of DDR3 SDRAM (1600 MHz) and a Seagate barracuda
7200.14 hard disk (6 GB/s). Experiments are executed on
Eclipse 4.3 running Java SE Runtime Environment 1.8.
To compare the three persistence solutions, we generate
three di erent EMF models from the MoDisco Java
Metamodel: (i) the standard EMF model, (ii) the CDO one and
(iii) the Neo4EMF one. We import both models from XMI
to CDO and Neo4EMF and we verify they contain the same
data after the import.</p>
      <p>Neo4EMF uses an embedded Neo4j database to store its
objects. To provide a meaningful comparison in term of
memory consumption we choose to use an embedded CDO
server.</p>
      <p>Experiment 1: Object creation. In this rst
experiment, we execute an in nite loop of object creation and
simply count how many objects have been created before a
OutOfMemoryException is thrown. We choose a
simple tree structure of three classes to instantiate from the
MoDisco Java metamodel: a parent ClassFile containing
1000 BlockComment and ImportDeclaration. The
resulting model is a set of independent element trees. For this
experiments we choose a 1 GB Java virtual machine and an
arbitrarily xed changelog size of 100 000 entries. Table 1
summarizes the results.</p>
      <p>Persistence Layer
#Created Elements
Note that the number given for Neo4EMF is an
approximation: we stop the execution before any OutOfMemory
error. The average memory used to create elements was
around 500 MB and does not seem to grow. This
performance is due to the dirty-saving mechanism: created
objects generate entries in the changelog. When the changelog
is full, changes are saved temporarily in the database, freeing
the changelog for next object creations.</p>
      <p>Experiment 2: Model traversal. In this experiment, we
load a model and execute a traversal query that starts from
the root of the model, traverses all the containment tree and
modi es the name attribute of all NamedElements. All
the modi cations are saved at the end of the execution.
During the traversal, we measure the execution time for covering
the entire model and the average memory used to perform
the query. In addition, we measure the memory needed to
save the modi cations at the end of the execution.
Figures 8 and 9 summarize memory results. As expected, the
Neo4EMF traversal footprint is higher than the XMI one
because we include the Neo4j embedded database and runtime
in our measures. Unloading brings a real interest when
comparing the results with CDO: when removing unused (i. e.,
unreferenced) objects we save space and process the request
in a reduced amount of memory. For this experiment we
use a 4 GB Java virtual machine, with the ConcMarkSweepGC
garbage collector, recommended when using Neo4j.
Experiment 3: Time performance. This experiment is
similar to the previous one, but we focus on time
performances. We measure the time needed to perform traversal
and save. Figures 10 and 11 summarize the results. To
provide a fair comparison between full and on-demand
loading strategies we also include model loading time with the
traversal queries.
1The execution was stopped before any memory exception.
Neo4EMF save performances can be explained with
dirtysaving: during the traversal, entries are generated to track
the name modi cations. These entries are then saved in the
database when the changelog is full, reducing the nal save
cost. This behavior also explains a part of the traversal time
overhead, when compared to CDO: Neo4EMF traversal
implies database write access for dirty saving where CDO does
not, related I/O accesses considerably impact performance.</p>
    </sec>
    <sec id="sec-12">
      <title>4.2 Discussion</title>
      <p>The results of these experiments show that dirty-saving
coupled with on-demand loading decrease signi cantly the
memory needed to execute a query. As expected, this memory
footprint improvement worsens the time performances of our
tool, in particular because of dirty-saving, which generates
several database calls. That is why we provide dirty
saving con guration through the Neo4EMF resource. The
experiments also show that Neo4EMF is able to handle large
queries and modi cations in a limited amount of memory,
compared to existing solutions.
We also run our benchmarks on di erent operating
systems (Ubuntu 12.04 and 13.10) and we nd that CDO and
Neo4EMF time performances seem to be linked to the le
partition format (especially in I/O accesses): Neo4j has
better performances on these operating system (with a factor
of 1.5) and CDO has slower times (with approximately the
same factor). More investigation is needed to optimize our
tool in di erent contexts.</p>
      <p>Our experiments show that Neo4EMF is an interesting
alternative to CDO to handle large models in memory
constrained environment. On-demand loading and
transparent unloading o er a small memory footprint (smaller than
CDO in our experiments), but our solution does not provide
advanced features like collaborative edition and versioning
provided by CDO.</p>
      <p>The unload strategy is transparent for the user, but may be
intrusive in some cases, for instance if the hard-drive
memory space is limited or the time performances are critical.
This is why we introduce con guration for dirty saving and
changelog size through the Neo4EMF resource.</p>
    </sec>
    <sec id="sec-13">
      <title>5. RELATED WORK</title>
      <p>
        Models obtained by reverse engineering with EMF-based
tools such as MoDisco [
        <xref ref-type="bibr" rid="ref11 ref17 ref5">17, 5, 11</xref>
        ] can be composed of
millions of elements. Existing solutions to handle this kind of
models have shown clear limitations in terms of memory
consumption and processing.
      </p>
      <p>
        CDO is the de facto standard to handle large models using
a server and a relational database. However, some
experiments have shown that CDO does not scale well to very
large models [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Pagan et al. [
        <xref ref-type="bibr" rid="ref14 ref15">14, 15</xref>
        ] propose to use NoSQL
databases to store models, especially because those kind of
databases should t better to the interconnected nature of
EMF models.
      </p>
      <p>
        Mongo EMF [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] is a NoSQL approach that stores EMF
models in MongoDB, a document-oriented database. However,
Mongo EMF storage is di erent from the standard EMF
persistence backend, and cannot be used as is to replace an
other persistence solution in an existing system. Modi
cations on the client software are needed to integrate it.
Morsa [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] is an other persistence solution based on
MongoDB database. Similarly to Neo4EMF, Morsa uses a
standard EMF mechanism to ensure persistence, but it uses a
client-server architecture, like CDO. Morsa has some
similarities with Neo4EMF, notably in its on-demand loading
mechanism, but does not use a graph database.
EMF Fragments [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] is another EMF persistence layer based
on a NoSQL database. The EMF Fragments approach is
different from other NoSQL persistence solutions: it relies on
the proxy mechanism provided by EMF. Models are
automatically partitioned and loading is performed by partition.
Loading on demand is only performed for cross-partition
references. Another di erence with Neo4EMF is that EMF
Fragments needs to annotate the metamodels to provide the
partition set, whereas our approach does not require model
adaptation or tool modi cation.
      </p>
    </sec>
    <sec id="sec-14">
      <title>6. CONCLUSION AND FUTURE WORK</title>
      <p>In this paper, we presented a strategy to optimize the
memory footprint of Neo4EMF, a persistence layer designed to
handle large models through on-demand loading and
transparent unloading. Our experiments show that Neo4EMF is
an interesting alternative to CDO for accessing and
querying large models, especially in small available memory
context, with a tolerable performance loss. Neo4EMF does not
have collaborative model editing or model versioning
features, which biases our results: providing those features may
imply a more important memory consumption.</p>
      <p>In future work, we plan to improve our layer by providing
partial collection loading, allowing the loading of large
collections subparts from the database. In our experiments, we
detected some memory consumption overhead in this
particular case: when an object contains a huge number of
referenced objects (through the same reference) and they are
all loaded at once.</p>
      <p>We then plan to study the inclusion of attribute and
reference meta-information directly in the database to avoid
unnecessary object loading: some EMF mechanisms, like
isSet may induce load on demand of the associated attribute,
just in order to make a comparison. It could be
interesting to provide this information from the database without a
complete and costly object loading.</p>
      <p>Finally, we want to introduce loading strategies such as
prefetching or model partitioning (using optional metamodel
annotations or a de nition of the model usage) to allow users
to customize the object life cycle.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>AtlanMod. Neo4EMF</surname>
          </string-name>
          ,
          <year>2014</year>
          . url: http://www.neo4emf.com/.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>K.</given-names>
            <surname>Barmpis</surname>
          </string-name>
          and
          <string-name>
            <given-names>D. S.</given-names>
            <surname>Kolovos</surname>
          </string-name>
          .
          <article-title>Comparative analysis of data persistence technologies for large-scale models</article-title>
          .
          <source>In Proceedings of the 2012 Extreme Modeling Workshop</source>
          , XM '
          <volume>12</volume>
          , pages
          <fpage>33</fpage>
          {
          <fpage>38</fpage>
          , New York, NY, USA,
          <year>2012</year>
          . ACM.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>A.</given-names>
            <surname>Benelallam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Gomez</surname>
          </string-name>
          , G. Sunye,
          <string-name>
            <given-names>M.</given-names>
            <surname>Tisi</surname>
          </string-name>
          , and
          <string-name>
            <given-names>D.</given-names>
            <surname>Launay</surname>
          </string-name>
          .
          <article-title>Neo4emf, a scalable persistence layer for emf models</article-title>
          .
          <source>July</source>
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>L.</given-names>
            <surname>Bettini. Implementing</surname>
          </string-name>
          Domain-Speci c
          <article-title>Languages with Xtext and Xtend</article-title>
          .
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>H.</given-names>
            <surname>Bruneliere</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Cabot</surname>
          </string-name>
          , G. Dupe, and
          <string-name>
            <given-names>F.</given-names>
            <surname>Madiot</surname>
          </string-name>
          .
          <article-title>Modisco: A model driven reverse engineering framework</article-title>
          .
          <source>Information and Software Technology</source>
          ,
          <volume>56</volume>
          (
          <issue>8</issue>
          ):
          <volume>1012</volume>
          {
          <fpage>1032</fpage>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>H.</given-names>
            <surname>Bruneliere</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Cabot</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Jouault</surname>
          </string-name>
          , and
          <string-name>
            <given-names>F.</given-names>
            <surname>Madiot</surname>
          </string-name>
          .
          <article-title>Modisco: A generic and extensible framework for model driven reverse engineering</article-title>
          .
          <source>In Proceedings of the IEEE/ACM International Conference on Automated Software Engineering, ASE '10</source>
          , pages
          <fpage>173</fpage>
          {
          <fpage>174</fpage>
          , New York, NY, USA,
          <year>2010</year>
          . ACM.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Bryan</given-names>
            <surname>Hunt. MongoEMF</surname>
          </string-name>
          ,
          <year>2014</year>
          . url: https://github.com/BryanHunt/mongo-emf/wiki/.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Eclipse</given-names>
            <surname>Foundation</surname>
          </string-name>
          .
          <source>The CDO Model Repository (CDO)</source>
          ,
          <year>2014</year>
          . url: http://www.eclipse.org/cdo/.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <article-title>[9] INRIA and LINA</article-title>
          .
          <source>ATLAS transformation language</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>Markus</given-names>
            <surname>Scheidgen</surname>
          </string-name>
          .
          <source>EMF fragments</source>
          ,
          <year>2014</year>
          . url: https: //github.com/markus1978/emf-fragments/wiki/.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>Modeliosoft</given-names>
            <surname>Solutions</surname>
          </string-name>
          ,
          <year>2014</year>
          . url: http://www.modeliosoft.com/.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>J.</given-names>
            <surname>Musset</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Juliot</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Lacrampe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Piers</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Brun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Goubet</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Lussaud</surname>
          </string-name>
          , and
          <string-name>
            <given-names>F.</given-names>
            <surname>Allilaire</surname>
          </string-name>
          . Acceleo user guide,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          <source>[13] OMG. MOF 2</source>
          .
          <article-title>0 QVT nal adopted speci cation</article-title>
          (ptc/05-11-01),
          <year>April 2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>J. E.</given-names>
            <surname>Pagan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. S.</given-names>
            <surname>Cuadrado</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J. G.</given-names>
            <surname>Molina. Morsa</surname>
          </string-name>
          :
          <article-title>A scalable approach for persisting and accessing large models</article-title>
          .
          <source>In Proceedings of the 14th International Conference on Model Driven Engineering Languages and Systems, MODELS'11</source>
          , pages
          <fpage>77</fpage>
          {
          <fpage>92</fpage>
          , Berlin, Heidelberg,
          <year>2011</year>
          . Springer-Verlag.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>J. E.</given-names>
            <surname>Pagan</surname>
          </string-name>
          and
          <string-name>
            <given-names>J. G.</given-names>
            <surname>Molina</surname>
          </string-name>
          .
          <article-title>Querying large models e ciently</article-title>
          .
          <source>Information and Software Technology</source>
          ,
          <year>2014</year>
          . In press, accepted manuscript. url: http://dx.doi.org/10.1016/j.infsof.
          <year>2014</year>
          .
          <volume>01</volume>
          .005.
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>J.</given-names>
            <surname>Partner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Vukotic</surname>
          </string-name>
          , and
          <string-name>
            <given-names>N.</given-names>
            <surname>Watt</surname>
          </string-name>
          . Neo4j in
          <string-name>
            <surname>Action. O'Reilly Media</surname>
          </string-name>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <article-title>The Eclipse Foundation</article-title>
          .
          <source>MoDisco Eclipse Project</source>
          ,
          <year>2014</year>
          . url: http://www.eclipse.org/MoDisco/.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>