=Paper=
{{Paper
|id=Vol-1406/paper4
|storemode=property
|title=An Efficient Computation Strategy for allInstances()
|pdfUrl=https://ceur-ws.org/Vol-1406/paper4.pdf
|volume=Vol-1406
|dblpUrl=https://dblp.org/rec/conf/staf/WeiK15
}}
==An Efficient Computation Strategy for allInstances()==
<pdf width="1500px">https://ceur-ws.org/Vol-1406/paper4.pdf</pdf>
<pre>
           An Efficient Computation Strategy
                    for allInstances()

                        Ran Wei and Dimitrios S. Kolovos

                         Department of Computer Science,
                        University of York, United Kingdom
                       {ran.wei, dimitris.kolovos}@york.ac.uk


      Abstract. Contemporary model query and transformation engines typ-
      ically provide built-in facilities for retrieving all instances of a particular
      type/kind regardless of their location in a model (i.e. OCL’s allInstances()).
      When implemented in a naive manner, such facilities can be computa-
      tionally expensive for large models. We contribute a novel approach for
      implementing allInstances()-like facilities for EMF models, which makes
      use of static analysis and metamodel introspection and we report on the
      results of extensive benchmarking against alternative approaches.


1   Introduction

As models involved in MDE processes get larger and more complex[1, 2], model
query and transformation languages are being stressed to their limits [3, 4].
One of the most computationally-expensive operations that model query and
transformation engines support is the ability to retrieve collections of instances
of a particular type/kind regardless of their location in a model (i.e. OCL’s
allInstances()). In this paper we discuss existing strategies for computing such
collections of instances and we highlight their advantages and shortcomings. We
then contribute a novel computation strategy that makes use of static analy-
sis and metamodel introspection to pre-compute and cache all such collections
needed in the context of a query in one pass. We present an implementation
of the proposed strategy on top of an existing model query language (Epsilon’s
EOL [5]) and benchmark it against alternative computation strategies.


2   Background and Motivation

The majority of contemporary model query and transformation languages pro-
vide support for retrieving collections of all model elements that are instances of
a particular type/kind. For example, OCL, QVTr, ATL, and Acceleo provide the
built-in allInstances() operation which can be invoked on a type to return a set
containing all its instances (e.g. Person.allInstances()), Epsilon’s EOL provides
the getAllOfType() and getAllOfKind() operations, and QVTo the objects(type :
Type) and objectsOfType(type : Type) operations that operate in a similar way.
We collectively refer to all such operations as allInstances() in the remainder of
the paper.
    For file-based EMF models, a naive strategy to implement allInstances() is to
navigate the in-memory model element containment tree upon invocation, and
collect and return all instances of the requested type. Repeatedly traversing the
containment tree to fetch all instances of the same type for multiple invocations
of the operation on that type is clearly inefficient, so the majority of model query
and transformation engines provide support for caching and reusing the results
of previous invocations of the operation (this is straightforward for side-effect
free languages but requires some additional book-keeping for languages that can
mutate the state of a model).
    When a query (or a transformation) contains a large number of calls to
allInstances() for different types, instead of traversing the containment tree for
each of these calls/types on demand, it can be more efficient for the execution
engine to pre-compute and cache all these collections in one pass at start-up
instead (greedy caching). This can incur a higher upfront cost and increase the
memory footprint, however, for a sufficiently high number of invocations on
different types, it is very likely to pay off eventually – particularly as models
grow in size.
    Overall, when more than one calls to allInstances() are made for different
types in the context of a query, the on-demand approach is sub-optimal in terms
of performance. On the other hand, if a query only calls allInstances() on a small
number of types (compared to the total number of types in the metamodel),
greedy caching is wasteful.


3     Program- and Metamodel-Aware Instance Collection

Given in-advance knowledge of the metamodel of a model, and the types on which
allInstances() is likely to be invoked in the context of a query (e.g. obtained
through static analysis of the query itself) operating on that model, in this
section we demonstrate how a query execution engine can efficiently pre-compute
and cache the results of only these invocations by traversing the contents of the
model only once.
    We demonstrate the proposed algorithms and their supporting data struc-
tures with reference to a concrete OCL-like query language (Epsilon’s EOL).
For conciseness, we also restrict the discussion to EOL queries operating on a
single EMF-based model which conforms to an Ecore metamodel comprising
exactly one EPackage. However, the proposed approach is trivially portable to
other query and transformation languages of a similar nature, and to queries
that involve more than one models conforming to multi-EPackage metamodels.


3.1   Cache Configuration Model

Figure 1 demonstrates a data structure (in the form of a metamodel), an instance
of which needs to be populated at compile-time (e.g. by statically analysing the
                        Fig. 1. Cache Configuration Metamodel


query of interest and by introspecting the metamodel of models on which it will
be executed) in order to facilitate efficient execution of allInstances() at runtime.
    CacheConfiguration acts as a container for the EClasses of the model’s meta-
model that the engine may need to retrieve all instances of in the context of the
query of interest. EClasses of interest can be linked to a CacheConfiguration
through the latter’s allOfKind and allOfType references (EOL, like QVTo, sup-
port distinct operations for computing all direct and indirect instances of a given
type). We intentionally refrain from discussion the traverse reference in Figure
1 for now.

3.2     Query Static Analysis
The first step of the process is to generate an initial version of the cache configu-
ration model by statically analysing the query of interest. Figure 2 demonstrates
the type-resolved abstract syntax graph of the example EOL program illustrated
in Listing 1.1, which operates on models conforming to the metamodel of Figure
3. To compute the initial version of the cache configuration model we need to
iterate through the abstract syntax graph and locate instances of:
    – MethodCallExpression for which the name of the method called is allOfKind,
      allOfType, allInstances (alias of allOfKind()), the resolved type of their tar-
      get expression is ModelElementType, and which have no parameter values;
    – PropertyCallExpression for which the name of the property is all (alias of
      allOfKind()), and the resolved type of their target expression is ModelEle-
      mentType.

                        Listing 1.1. An example EOL Program
1     WebPage.allOfType().println();
2     Member.allOfKind().println();

    Having identified the calls of interest, we construct a new CacheConfiguration
and for each call to allOfType() we create an allOfType link to its respective
EClass. Similarly, for all other calls of interest we link the respective EClasses
to the cache configuration via its allOfKind reference. The initial extracted cache
configuration model for our running example is illustrated in Figure 5.
       Fig. 2. The Abstract Syntax Graph of the EOL program of Listing 1.1


      Fig. 3. The University Metamodel            Fig. 4. The University Model


3.3    Containment Reference Pruning
Following the process discussed above, the execution engine can now be aware
of all the allInstances() collections it needs to pre-compute and cache (Web-


                Fig. 5. Initial Extracted Cache Configuration Model
Page.allOfType() and Member.allOfKind() in our running example). The next
step is to collect the model elements of interest in one pass and as efficiently
as possible. A straightforward collection strategy would involve navigating the
entire model containment tree, assessing whether each model element is of one
of the types of interest and, if so, adding it to the appropriate cache(es).
    However, by inspecting the example model in Figure 4, we observe that
traversing the containment closure of the modules reference of the “Computer
Science” Department model element is guaranteed not to reveal any model el-
ements of interest (according to the metamodel of Figure 3 modules can only
contain lectures and neither of these types of elements are of interest to the
query). This observation can be generalised and exploited to prune the subset
of the containment tree that the engine will need to visit in order to populate
the caches of interest.
    To achieve this we need to analyse the metamodel and compute the subset
of containment references that can potentially lead to elements of interest. The
proposed algorithm is illustrated in Algorithm 1. Please note that the algorithm
has been simplified for presentation purposes and that implementations of the
algorithm need to make use of memoisation to avoid infinite recursion that can be
caused by circular containment references of no interest. Adding the computed
containment references that need to be traversed at runtime to the (incomplete)
cache configuration model of Figure 5, produces the (complete) configuration
model of Figure 6.

3.4   Instance Collection and Caching
Having computed the cache configuration model, the final step includes travers-
ing only the identified containment references of the in-memory model at runtime
in a top-down recursive manner to collect and cache the elements of interest.
    For example, with reference to the example model of Figure 4, the instance
collection process starts at the top-level :University element. The element’s
EClass is not linked to the cache configuration via one of its allOfType or
allOfKind references, and as such the element is not cached. Navigating the uni-
versity’s departments reference reveals a :Department element, which also does
not need to be cached. The process does not need to navigate the department’s
modules reference as it is not linked to the cache configuration via the latter’s
traverse reference, and as such it proceeds with its members reference. Traversing
the members reference reveals an instance of Student and an instance of Staff,
both of which are cached in preparation for the Member.allOfKind() invocation.
Similarly, the webpage reference of :Staff is traversed and reveals a :WebPage,
which is also cached in preparation for the WebPage.allOfType() invocation.
let cm = the initial version of the configuration cache model;
let p = the EPackage that the model conforms to;
let refs = empty list of EReferences;

foreach non-abstract EClass c in p do
   foreach containment EReference r of c do
       call shouldBeTraversed(r);
   end
end

function shouldBeTraversed(r : EReference) : Boolean
   let types = transitive closure of r ’s type and all its sub-types;
   if types includes any of the EClasses in cm then
       add r to refs;
       return true;
   end
   else
       foreach containment EReference tr of each of the types do
          if shouldBeTraversed(tr) then
              return true;
          end
       end
       return false;
   end
end
      Algorithm 1: Containment Reference Selection Algorithm


                Fig. 6. Complete Cache Configuration Model
4      Evaluation

In this section we report on the results of benchmarks performed on four different
strategies for computing allInstances().

1. Lazy (on-demand) computation (L)
2. Greedy pre-caching (G)1
3. Type-aware pre-caching (T)2
4. Type-and-reference-aware pre-caching (TR)

Benchmarks were performed on a computer with Intel(R) Core(TM) i7 CPU @
2.3GHz, with 8GB of physical memory, running OS X Yosemite. The version of
the Java Virtual Machine used was 1.8.0 31-b13. Results are in seconds.
    For our benchmarks, models of varying sizes obtained from reverse engineered
Java code in the 2009 GraBaTs contest3 are used. These models, named set0,
set1, set2, set3 and set4 (9.2MB, 27.9MB, 283.2MB, 626.7MB, 676.9MB respec-
tively) are stored in XMI 2.0 format and have been used for various benchmarks
for different tools [6, 7].


4.1     Model Element Coverage

To quantify model coverage in our benchmarks, we counted the number of ele-
ments in each data set and then automatically generated EOL programs which
exercise 20%, 40%, 60%, 80% and 100% of the total elements for each data set.
An example generated EOL program is provided in Listing 1.2.
    We then executed all the generated EOL programs and measured perfor-
mance in terms of the time taken to load the models with the four different
strategies and the time taken to execute the programs.

    Listing 1.2. An example of generated EOL program for model element coverage
1     var size = 0;
2     var methodInvocation = MethodInvocation.all.first();
3     size = size + MethodInvocation.all.size();
4     var qualifiedName = QualifiedName.all.first();
5     size = size + QualifiedName.all.size();
6     ...
7     size.println();

1
  As discussed in Section 2, this approach naively pre-computes all possible allOfType
  and allOfKind caches.
2
  This approach makes use of static analysis as discussed in Section 3.2 but does not
  prune containment references and as such it needs to visit the entire containment
  tree at runtime. It is included in this benchmark only to assess the additional benefits
  of containment reference pruning.
3
  GraBaTs2009: 5th Int. Workshop on Graph-Based Tools, http://is.tm.tue.nl/
  staff/pvgorp/events/grabats2009/
4.2     Results
The obtained results are presented in Table 1. Initials L, G, T and TR rep-
resents the approaches aforementioned (Lazy, Greedy, Type-Aware and Type-
and-Reference-Aware). Since the execution time of the EOL programs for G,
T and TR is practically the same4 , we only present one result for all three of
them under the * columns. Imp. represents the performance improvement of a
certain approach, Load represents the time it takes to load the models, whereas
Exec. represents the time it takes to execute the EOL programs. Finally, Total
represents the time it takes to load the model and execute an EOL program for
a single experiment.
    From the benchmarks we observe that with the Greedy, Type-Aware and
Type-and-Reference-Aware approaches, programs execute significantly faster than
with the Lazy approach. These approaches require more time to load the models
due to the overhead incurred by their respective caching logic; such overhead
affects the performance for small data sets (set 0 in this case). However, as the
size of models gets larger, these approaches provide marginal benefits in terms
of the time it takes to load a model and to execute an EOL program (total
time). In general, TR provides better performance but for some cases in which
TR needs to visit elements deep in the containment tree, T and G marginally
outperform it. In terms of memory footprint, the three approaches behave very
similarly and incur a small linear overhead compared to L.


5     Related Work

Several database-based model persistence prototypes have been proposed for
persisting and loading large models, including Morsa [8], Neo4EMF [7], Mon-
goEMF [9], EMF Fragments [10] and Hawk [6]. The general idea behind these
prototypes is that they are able to load only the parts of a model that are needed
for the task at hand (e.g. to compute particular queries), so that large models can
be accessed efficiently both in terms of loading time and memory consumption.
     Computing allInstances() in such systems typically does not require travers-
ing the entire model and can be achieved through efficient internal queries ex-
pressed in the underpinning database’s native query language (e.g. SQL, Cypher).
Despite the clear technical advantages of database-based technologies, there are
still valid reasons for using file-based formats (e.g. XMI) for model persistence in
certain contexts, such as standards-compliance, tool interoperability, and com-
patibility with existing file-based version control systems such as Git and Sub-
version.


4
    This is expected as all three strategies populate all caches required before the EOL
    program executes.
Table 1. Benchmark results for Lazy, Greedy, Type-Aware, Type-and-Reference-
Aware caching (* in the table represents the results for G,T and TR collectively).

             L      G         T      TR   *    Imp.G Imp.T Imp.TR Imp.*
Perc.
        Load Exec. Load      Load   Load Exec. Total Total Total Exec.
            sec.   sec.      sec.   sec. sec.    %     %     %      %
                                     Set0
 20% 0.552 0.015 0.652 0.554 0.572 0.001 -15.17% 2.12% -1.06% 93.33%
 40% 0.555 0.007 0.631 0.572 0.561 0.002 -12.63% -2.14% -0.18% 71.43%
 60% 0.549 0.012 0.645 0.571 0.573 0.003 -15.51% -2.32% -2.67% 75.00%
 80% 0.543 0.026 0.652 0.573 0.576 0.005 -15.47% -1.58% -2.11% 80.77%
100% 0.552 0.141 0.638 0.623 0.619 0.013 6.06%           8.23%   8.80% 90.78%
Perc.                                Set1
 20% 1.643 0.606 1.856 1.653 1.672          0.01 17.03% 26.06% 25.21% 98.35%
 40% 1.596 0.595 1.875 1.736 1.711 0.011 13.92% 20.26% 21.41% 98.15%
 60% 1.587 0.556 1.843 1.786 1.773 0.013 13.39% 16.05% 16.66% 97.66%
 80% 1.611 0.571      1.86   1.787 1.788 0.017 13.98% 17.32% 17.28% 97.02%
100% 1.606 0.626 1.866 1.852 1.852 0.021 15.46% 16.08% 16.08% 96.65%
Perc.                                Set2
 20% 14.159 2.244 17.169 14.802 14.809 0.007 -4.71% 9.72%        9.68% 99.69%
 40% 14.061 4.402 17.979 16.587 16.613 0.015 2.54% 10.08% 9.94% 99.66%
 60% 14.456 3.305 16.96 16.276 15.851 0.02       4.40%   8.25% 10.64% 99.39%
 80% 15.151 5.685 18.145 17.724 18.217 0.03 12.77% 14.79% 12.43% 99.47%
100% 15.223    6.2   17.32 17.769 17.839 0.036 18.98% 16.89% 16.56% 99.42%
Perc.                                Set3
 20% 34.199 8.706 38.096 34.17 33.753 0.017 11.17% 20.32% 21.29% 99.80%
 40% 31.786 9.756 37.552 35.086 34.809 0.028 9.54% 15.47% 16.14% 99.71%
 60% 31.835 12.222 37.528 36.516 35.662 0.045 14.72% 17.01% 18.95% 99.63%
 80% 32.417 11.456 39.301 39.302 37.795 0.068 10.27% 10.26% 13.70% 99.41%
100% 35.872 13.7 38.659 40.779 40.513 0.071 21.87% 17.59% 18.13% 99.48%
Perc.                                Set4
 20% 36.133 7.586 43.745 39.477 37.278 0.018 -0.10% 9.66% 14.69% 99.76%
 40% 37.99 12.973 43.515 41.044 41.01 0.039 14.54% 19.39% 19.45% 99.70%
 60% 36.457 14.131 44.883 42.348 41.055 0.05 11.18% 16.19% 18.75% 99.65%
 80% 37.782 11.762 41.932 44.038 45.168 0.065 15.23% 10.98% 8.70% 99.45%
100% 37.617 14.563 44.813 46.914 43.406 0.078 13.97% 9.94% 16.67% 99.46%
6    Conclusion and Future Work
In this paper we have proposed a novel approach for computation and caching
of allInstances()-like operations (e.g. used in declarative model transformation
rules) on in-memory EMF models. We have compared the proposed approach
against three alternative approaches via extensive benchmarking and demon-
strated the benefits it delivers in terms of aggregate model loading and query
execution time. Such an approach brings benefits only to model management
programs which trigger multiple calls to allInstances().
    In future iterations of this work, we wish to investigate how static analysis
and metamodel introspection can be used to further improve performance of
computationally-expensive queries at runtime (e.g. by constructing and main-
taining in-memory indexes that can improve the performance of collection filter-
ing operations applied to the results of allInstances()).


References
 1. Parastoo Mohagheghi, Miguel A Fernandez, Juan A Martell, Mathias Fritzsche,
    and Wasif Gilani. MDE adoption in industry: challenges and success criteria. In
    Models in Software Engineering, pages 54–59. Springer, 2009.
 2. Paul Baker, Shiou Loh, and Frank Weil. Model-Driven Engineering in a Large
    Industrial Context. In Model Driven Engineering Languages and Systems, pages
    476–491. Springer, 2005.
 3. Dimitrios S. Kolovos, Richard F. Paige, and Fiona AC Polack. Scalability: The holy
    grail of model driven engineering. In ChaMDE 2008 Workshop Proceedings: In-
    ternational Workshop on Challenges in Model-Driven Software Engineering, pages
    10–14, 2008.
 4. Marcel Van Amstel, Steven Bosems, Ivan Kurtev, and Luı́s Ferreira Pires. Perfor-
    mance in model transformations: experiments with ATL and QVT. In Theory and
    Practice of Model Transformations, pages 198–212. Springer, 2011.
 5. Dimitrios S. Kolovos, Richard F Paige, and Fiona AC Polack. The Epsilon Object
    Language (EOL). In Model Driven Architecture–Foundations and Applications,
    pages 128–142. Springer, 2006.
 6. Konstantinos Barmpis and Dimitris Kolovos. Hawk: Towards a scalable model
    indexing architecture. In Proceedings of the Workshop on Scalability in Model
    Driven Engineering, page 6. ACM, 2013.
 7. Amine Benelallam, Abel Gómez, Gerson Sunyé, Massimo Tisi, and David Launay.
    Neo4EMF, a scalable persistence layer for EMF models. In Modelling Foundations
    and Applications, pages 230–241. Springer, 2014.
 8. Javier Espinazo Pagán, Jesúss Sánchez Cuadrado, and Jesús Garcı́a Molina. Morsa:
    A scalable approach for persisting and accessing large models. In Model Driven
    Engineering Languages and Systems, pages 77–92. Springer, 2011.
 9. Bryan Hunt. MongoEMF, 2014, https://github.com/BryanHunt/mongo-emf/wiki.
10. Markus Scheidgen. Reference representation techniques for large models. In Pro-
    ceedings of the Workshop on Scalability in Model Driven Engineering, page 5. ACM,
    2013.

</pre>