<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Parallel execution of rst-order operations</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Sina Madani</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Dimitrios S. Kolovos</string-name>
          <email>dimitris.kolovos@york.ac.uk</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Richard F. Paige</string-name>
          <email>richard.paige@york.ac.uk</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computer Science, University of York</institution>
          ,
          <country country="UK">UK</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>The growing size of software models poses signi cant scalability challenges. Amongst these scalability issues is the execution time of queries and transformations. Although the processing pipeline for models may involve numerous stages such as validation, transformation and code generation, many of these complex processes are (or can be) expressed by a combination of simpler and more fundamental operations. In many cases, these underlying operations are pure functions, making them amenable to parallelisation. We present parallel execution algorithms for a range of iteration-based operations in the context of the OCL-inspired Epsilon Object Language. Our experiments show a significant improvement in the performance of queries on large models.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        Modern software systems are often required to process an ever-increasing volume
of complex data with more stringent throughput and latency requirements.
Although model-driven engineering helps to curtail the complexity of systems, the
performance of many popular modelling tools leaves much to be desired. This
is particularly problematic since larger projects are arguably likely to bene t
the most from a model-driven approach. Scalability is a notable challenge with
model-driven engineering [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], and a multi-faceted one too [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
      <p>Model management work ows often involve a variety of tasks such as
validation, comparison, model-to-model and model-to-text transformations. Despite
their di erences, these tasks typically use a common set of queries and
transformations on collections of model elements. With very large models (in the
order of millions of elements and gigabytes in size), these operations can incur a
signi cant performance cost, making the process slow and less productive.
Furthermore, even the most complex queries on collections of data (model elements)
can be expressed using a relatively small set of simpler operations. The Object
Constraint Language (OCL) is one of the most well-known and frequently used
languages for querying and validating models. As a functional and declarative
language, OCL o ers a useful set of operations on collections, including
operations involving predicate logic. In these rst-order operations, a function is
applied to some or all of the elements.</p>
      <p>In this paper, we demonstrate that nearly all rst-order operations can be
executed in a data-parallel manner. We illustrate this with a broad set of
operations, outlining the challenges with implementing parallel algorithms for these
operations and their respective solutions. Given the diminishing generational
improvements in single-thread performance due to physical and technical
constraints, combined with the increasing number of cores in virtually all computing
devices, we posit that sequential algorithms for inherently parallelisable
computations are not optimal when processing large data sets on modern computers.</p>
      <p>The remainder of the paper is organised as follows. Section 2 reviews
pertinent work on optimising iteration operations. Section 3 introduces Epsilon,
which we use to implement and evaluate our solutions, as well as discussing
some general pre-requisites for concurrent execution. Section 4 reviews twelve
iteration-based operations and describes how they can be re-implemented with
a parallel execution algorithm. Section 5 provides a brief overview of our
testing methodology and performance metrics. Section 6 concludes the paper and
suggests extensions for future developments.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Background and related work</title>
      <p>Most rst-order operations on collections have some desirable properties which
enable a number of optimisations to be made in their execution algorithms.
All rst-order operations involve a lambda expression as a parameter, which is
typically a predicate. This function takes as input an element from the source
collection and evaluates an expression on it (typically to a Boolean value in
the case of predicates). Regardless of the transformation type, one of the most
crucial properties of rst-order operations is that their transformation functions
can operate on each element independently. That is, the lambdas are usually
pure functions and so they have no e ect on the global state of the program in
languages like OCL. The motivation for parallel execution thus stems from the
observation that in most cases, the same lambda expression can be executed on
each element of the source collection in any order, with no dependencies between
each invocation and, therefore, no inherent requirement for serial execution.</p>
      <p>
        Parallel execution of the same instructions over multiple data (SIMD) is a
form of data-parallelism which is common in traditional \pleasingly parallel"
tasks such as video rendering. This class of problems can often map closely to
the specialised hardware architectures of graphics processing units.
Parallel streams Data parallelism can also be harnessed by general-purpose
applications on conventional processors. The Java standard library provides
streams [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], which are an abstract processing pipeline over a xed or in nite
data stream. Streams are iterator-based, and so they can be used on
collections. The Streams API in Java provides rst-order operations such as ltering
and transformations in a generic and exible manner, allowing the programmer
to chain operations and perform post-processing or collect the results into any
desired data structure. Streams can also execute in parallel, which inevitably
requires that the output is unordered.
      </p>
      <p>Parallel streams internally use a divide-and-conquer approach, delegating
Java's fork-join processing framework. The key to making this possible is the
ability to split the data source1, and perhaps more fundamentally, assuming
that none of the operations have side-e ects or rely on mutable global state.</p>
      <p>
        Furthermore, the iterator-based nature of streams means that the entire
operation chain can be evaluated on individual elements, enabling lazy evaluation
[
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. That is, instead of requiring the intermediate results of e.g. a lter (select
in OCL) operation be pooled into a collection only to be ltered again, the
operations themselves can be fused to provide short-circuiting behaviour2.
Lazy evaluation On the topic of lazy evaluation of expressions on
collections, Tisi et al. (2015) also propose an iterator-based approach in [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. The
basic premise of their work is to treat OCL collections in a similar manner to
Java Streams, such that operations on collections are evaluated only when
required by a subsequent computation. This is achieved by returning an iterator
which evaluates the desired expression on each element when it is iterated over;
which may be in a chain of other operations. The authors also apply this lazy
approach to the allInstances() operation, which retrieves all elements of the
target type. Since this is often the source collection on which further operations are
invoked, it enables a lazy evaluation strategy to be applied in the entire chain
of computation rather than only the intermediate operations.
      </p>
      <p>
        Formal parallelism In [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], Vajk et al. (2011) take a more formal approach
to parallel execution of OCL expressions. Using the well-established
Communicating Sequential Processes (CSP) model of concurrency, the authors provide a
mapping between OCL expressions and CSP processes, focusing on binary
expressions for task parallelism and iterators for data parallelism. The CSP is then
compiled to C# code. However they do not provide a complete library of parallel
operations, instead relying on the most general rst-order operation { iterate {
to implement and evaluate their approach.
      </p>
      <p>Our contribution does not attempt to prove the correctness of a parallel
approach, rather it attempts to highlight and demonstrate solutions for the
practical technical challenges in hand-coding parallel algorithms for iteration
operations on collections in OCL-based languages.
3</p>
    </sec>
    <sec id="sec-3">
      <title>Epsilon Object Language (EOL)</title>
      <p>
        Epsilon [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] is an open-source family of task-speci c model management languages
which build upon an imperative, dynamically typed and interpreted
\modeloriented" language. This base language { the Epsilon Object Language (EOL)
[
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] { provides many of the familiar programming constructs and syntax of Java,
such as loops, if/else statements, mutable variables and user-de ned operations,
which may be impure functions. EOL is inspired by OCL in its design, and thus
provides almost all of the functionality of OCL with similar syntax and names
1 see docs.oracle.com/javase/10/docs/api/java/util/Spliterator.html
2 see [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] for an example of this
for types and built-in operations. Invariants can be expressed in the more speci c
Epsilon Validation Language [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ].
      </p>
      <p>Where the Epsilon Object Language di ers from OCL is in its more
expressive features, allowing for a mix of imperative and declarative style of
programming. Since Epsilon is interpreted and written in Java, one of the distinguishing
features is that of native types, which enable the creation of native Java objects
and invocation of methods on these types. Another notable feature is extended
properties, which are additional key-value pairs that can be used to decorate
individual variables or model element types.
3.1</p>
      <sec id="sec-3-1">
        <title>Epsilon thread safety</title>
        <p>Before considering parallelisation of rst-order operations, it is important to note
and address some general concerns with concurrent execution of an interpreted
language like EOL. Although rst-order operations are usually free from
sidee ects, they may depend on global state in the program and on operations; which
in EOL could potentially have side-e ects. For instance, a rst-order operation's
predicate may depend on some threshold value which is set globally or as a
parameter to the program. The programmer may also nd it useful to factor
out common functionality of their lambda expressions into an operation to avoid
repetition and improve readability. Another common use case in complex queries
is the nesting of rst-order operations. All of these cases should be supported
when executing concurrently in a transparent manner to the user.</p>
        <p>
          Building on our previous work in [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ], we identi ed and solved the main
challenges with concurrency in EOL. These can be summarised as follows:
Variable declarations Concurrent execution requires a clear policy on
variable scoping and visibility between threads. The policy for Epsilon is that any
variables declared prior to concurrent execution are accessible for both reads and
writes, with writes being strongly discouraged. Variables declared during
concurrent execution are localised to the executing job or thread, with no visibility
or access once the job has completed. This is achieved by using a thread-local
data structure and making the main thread's frame stack visible to all threads.
Exception handling The EOL engine keeps track of the execution trace for
the program so that in the event of an error, such as navigation of a null property
or references to a non-existent variable or operation, the location in the program
which caused the error can be reported to the user through a (Java-like) stack
trace. With multiple threads of execution, the engine needs to keep track of
the trace for each thread, terminating execution when an exception occurs and
reporting the cause in the main thread. We solve the former issue by having a
stack trace manager for each thread. Terminating execution upon encountering
an exception requires us to create an additional thread which waits for
concurrent execution to complete, and having the main thread wait on a condition.
When concurrent execution has completed or an exception is raised, the main
thread is noti ed and appropriate action is taken. In the case of exceptions, the
Epsilon stack trace is computed and cached, all concurrent jobs are cancelled,
the executor service is terminated and nally the exception is reported.
Caches Various shared mutable data structures exists in the EOL engine,
some of which are lazily initialised to save memory. For example, invocations
of allOfKind and allOfType on model element types populate a cache to avoid
querying the model. Features such as cached operations and extended
properties also pose concurrency complications. All of these caches are shared and thus
thread-safe using a high-performance data structure which only synchronizes for
writes (i.e. ConcurrentHashMap).
4
        </p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Parallel implementation of rst-order operations</title>
      <p>In this section, we explore the declarative operations on Collection types in
Epsilon and demonstrate our solutions for their parallel variants. As previously
noted, since Epsilon is arguably more feature-rich and expressive than OCL,
the solutions outlined in this section should be generalisable to other
OCLlike languages. The more general challenges with concurrency discussed in the
previous section are e ectively encapsulated in the engine's execution context,
allowing us to implement the rst-order operations in parallel without needing
to consider such issues in every case.</p>
      <p>Most rst-order operations in Epsilon are \select-based", as will become
apparent in the subsequent subsections. In general, there are two types of operation
executions: those which execute the given lambda expression on all elements, and
those which execute it until a certain condition is met. In the latter case, this
is applicable to operations where the lambda expression is a predicate. Table 1
shows the rst-order operations and their properties. The type T refers to the
type of the source collection, whilst other letters refer to derived types.</p>
      <p>Although not all of the operations in Table 1 exist in OCL, we chose to
implement these in order to deeply explore the challenges with converting from
sequential to parallel execution algorithms.
4.1</p>
      <p>select and reject
The select operation is a lter on the collection, returning a subset for which
the given predicate is true. Since this operation's processing logic can be
reused, there are two additional internal parameters. The rst of these is whether
the operation should return when a match is found (i.e. the predicate evaluates
to true), and the second is whether the operation is a reject or select. With
these parameters, the rst four operations in Table 1 can be handled by a single
code base in the sequential implementation. reject can also be expressed using
select, since c.reject(ij&lt;p&gt;) === c.select(ijnot &lt;p&gt;) for a predicate p. In the
sequential implementation, it is also relatively trivial to implement selectOne and
rejectOne in the same code as select and reject. selectOne ("any " in OCL) returns
the rst value which satis es the predicate, whilst rejectOne adds all of the
source collection to the results and removes the rst value which does not satisfy
the predicate. The implementation becomes more complex when executing in
parallel, so the same approach and degree of re-use is not possible.
4.2</p>
      <p>parallelSelect and parallelSelectOrdered
As with all other rst-order operations, we can execute select in parallel by
submitting a new job to an executor service such as a thread pool (which takes
care of managing thread lifecycles and mapping jobs to threads), for each
element in the collection, so that the predicate is then evaluated for each element
independently. In the sequential implementation, every time a matching value
is found it is added to the results collection. However when executing in
parallel, this would require synchronization on the result collection for every write,
which would greatly diminish the performance gains from parallelism. There is
also the matter of ordering the results in a consistent manner with sequential
select. For this reason, we o er two implementations: parallelSelect when order
is immaterial, and parallelSelectOrdered when ordering is desired.</p>
      <p>
        If encounter order is to be preserved, one solution is to use Futures [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. The
idea is that when each job is submitted to the executor service, an object which
encapsulates the result of the job's computation is returned, and can later be
used to retrieve the result. We therefore add all of the futures to a collection
and subsequently loop through this collection and get the result for each job in
the order that they were submitted3. Although getting the result from a Future
is blocking, since select requires us to operate on every element of the source
collection, all results must be obtained before proceeding.
      </p>
      <p>Collecting the results is also not as straightforward as in the sequential case.
Recall that a Future returns the result of an asynchronous computation. In the
3 Ordering is guaranteed since we add the futures and request their results sequentially
case of parallelSelectOrdered, this result is not always present. Although the type
of the result is determined { the same as that of the source Collection's elements
{ we also need a mechanism to signal the absence of a result for values which
the predicate is false. Note that this cannot simply be null, since an item for
which the predicate is true may also be null. We therefore wrap the result of
the computation into an Optional 4 (or any other arbitrary container, such as a
singleton collection). This way, if the Optional itself is null, we know there was no
value present. If the value is present, it can be represented by an Optional (a null
value is represented by an empty Optional). Once all jobs have completed and
their values obtained, we can add the contents of all non-null Optionals to the
results collection sequentially, eliminating the need for a thread-safe collection.</p>
      <p>Although the solution with Futures guarantees ordering, it does not o er
the best performance due to the overhead of creating additional wrapper
objects. For improved throughput, an alternative for gathering results would be
to use persistent thread-local collections { essentially, a map of threads to their
respective collection of values { and merge results at the end. We use this
implementation in parallelSelect. Another option would be to annotate objects with
their encounter order and restore this order in the returned collection, however
this would pose additional sorting and object creation overhead.
4.3</p>
      <p>parallelReject and parallelRejectOrdered
Since reject is e ectively the same operation as select with negation of the
predicate, we can apply the same approach for parallelSelect (and
parallelSelectOrdered ) as we did in the sequential variant. In both cases, the predicate
needs to be evaluated on the entire source collection and a collection is returned.
4.4</p>
      <p>parallelSelectOne
The selectOne operation di ers from select in that the return type is a single
value, rather than a collection of values. This means that the operation may
return without evaluating the predicate on the entire source collection, once
a matching value is found. In the case of parallelSelectOne, we cannot simply
modify the parallelSelect logic to achieve this behaviour as with the sequential
implementation. Since we only need to nd a single value, there is no need to use
Futures and wait for all computations to complete. Instead, we need a mechanism
which allows us to notify the main thread that a result has been found, pass this
result and stop all other jobs, since they are no longer required once a matching
value has been found. Furthermore, we also need to ensure that if no matching
value can be found, the main thread does not wait inde nitely for a result.</p>
      <p>To achieve this, we use an \execution status" object which encapsulates
multiple lock conditions, allowing us to wait for either of them. There are two possible
ways in which parallel execution of submitted jobs can complete: exceptionally
or successfully. So far, we have assumed that successful completion is signalled
4 docs.oracle.com/javase/10/docs/api/java/util/Optional.html
by the completion of all jobs. However for the case of parallelSelectOne, we add
another condition to our execution status so that when a result is available, the
job which found it can invoke a method on the execution status object with the
value, which will interrupt the main thread in a similar manner to how an
exceptional completion would. Upon interruption, all running jobs are terminated
and the value set on the execution status object is returned.</p>
      <p>This operation introduces an inconsistency with the sequential selectOne in
that the returned value may potentially be di erent for each invocation, whereas
the sequential variant will always return the rst value which matches the
predicate as returned by the source collection's iterator. This is because each
computation occurs in parallel and it is non-deterministic which will nish rst.
However as the name of the OCL operation suggests (\any "), the chosen value
need not necessarily be the rst. The parallel variant is thus not suitable in cases
where the user desires a \selectFirst " operation.
4.5</p>
      <p>parallelRejectOne
The rejectOne operation is a hybrid of selectOne and select in that it does not
need to evaluate the predicate on the entire source collection, but it returns
a collection instead of a single value. Consequently, parallelRejectOne is not
a simple modi cation of parallelSelect or parallelSelectOne. However, since we
only need to exclude a single value from the source collection which matches
the predicate, we can delegate the predicate to a parallelSelectOne operation,
and remove the returned value from the source collection. However we still have
the issue of determining whether a result was found or not, so that we do not
inadvertently remove a null value from the source. This can be circumvented by
using a simple Boolean ag on the parallelSelectOne operation instance, which
is set when a result is found. Since this value will only ever be set from false
to true and never queried during parallel execution, there are no concurrency
issues to resolve. As with parallelSelectOne, this operation is inconsistent with
its sequential variant in that the returned collection may exclude an element
which is not the rst value for which the predicate is not satis ed.
4.6</p>
      <p>parallelExists
The exists operation returns true if there is at least one element in the source
collection which satis es the predicate (essentially a logical OR of the predicate
on each element). This is therefore shorthand for whether a selectOne operation
with that predicate has a result. Since we added this Boolean ag to
parallelSelectOne, we can simply delegate this operation and return the value of the ag.
4.7</p>
      <p>parallelNMatch
The nMatch operation requires that exactly n elements in the source collection
satisfy the predicate. Commonly, this value is either zero (none operation) or
one (one operation). In all cases, the predicate does not need to be evaluated on
all elements if either n+1 elements are found for which the predicate is satis ed,
or if not enough elements have been found depending on the relative size of the
source collection and n. This requires a loop variable which is incremented every
time a match is found, and also the number of elements evaluated (the loop
index). As the former is a shared mutable counter, we use an AtomicInteger,
which enables us to atomically increment and retrieve the value directly from
memory using a single CPU instruction without any synchronization. Although
the latter counter is incremented in each iteration, it is e ectively an immutable
input parameter to each parallel job (much like the current element) because
we loop through the source collection and submit jobs sequentially. The
coordination mechanism for short-circuiting is similar to that of parallelSelectOne,
except that we do not need to set a value as the result in the execution status.
4.8</p>
      <p>parallelForAll
The forAll operation is similar to exists except that it requires the predicate
to hold for all elements in the collection (i.e. logical AND instead of OR). One
possible implementation would be to delegate to nMatch, with n being the size
of the source collection. Alternatively, we can re-use parallelSelectOne by adding
a new ag which inverts the predicate similar to rejectOne. If no element satis es
the negated predicate, we e ectively achieve the same semantics as the sequential
variant. In other words, we are looking for a counter-example to the predicate
and in doing so, we can return without evaluating the predicate for all elements.
4.9</p>
      <p>parallelCollect and parallelCollectOrdered
The collect operation maps every element of the source collection to another
value, potentially of a di erent type. We can therefore perform this mapping
for every element in parallel. As with parallelSelect, we o er two modes: ordered
and unordered. It is worth noting that in the ordered case, we do not need a
wrapper such as Optional since every element will be mapped.
4.10</p>
      <p>parallelSortBy and parallelSortByOrdered
The sortBy operation orders the elements of the source collection by comparing
the result of the expression for each element. This expression can be executed
independently for each element in parallel, since this part is a simple extension
of the parallelCollect operation. Once this collection of comparable properties is
obtained through the mapping function, we can sort the collection in parallel;
- a task which we leave to the Java standard library to perform using a
divideand-conquer approach5.
5 docs.oracle.com/javase/10/docs/api/java/util/Arrays.html
The mapBy operation derives a property from each element in the collection and
uses it as a key to map the elements. Since this operation returns a multi-map
(that is, a key which can have multiple values associated with it), there is an
inherent need for synchronization because the key may or may not be present in
the results collection at any given time, and the collection of associated values
also requires synchronization for writes. To avoid this synchronization requires
a fundamentally di erent execution algorithm. Our solution once again achieves
data-parallelism in a similar manner to the parallelCollect operation, however
instead each job returns a tuple; representing a mapping from the derived result
(as obtained by executing the expression for the given element) to the element
itself. We can then merge the individual entries (tuples) afterwards, where we
resolve duplicate mappings by joining the mapped values into collection for each
duplicate entry key, which can be performed in a declarative manner using Java
Streams and Collectors API.
The aggregate operation is unique in that it takes up to three lambda expressions
as a parameter and returns a key-value map. The rst derives a key property,
similar to sortBy, the second derives a value property, and the third is an optional
initialisation parameter. Rather than simply performing a key-value mapping,
aggregate also has an accumulator variable (the \total"), which is the value
mapping associated with the key expression. If no value is set, then the initialiser
expression is executed, if supplied to the operation. This operation is e ectively
then a more complex variant of OCL's iterate operation; which behaves like a
traditional for loop. Due to this accumulator variable, this algorithm is di
cult to express in a non-sequential manner. One possibility is to execute each
key expression in parallel, since their \totals" (value aggregations) are
unrelated. Without some pre-requisite knowledge supplied as additional parameters,
annotations or advanced static analysis, a parallel solution may not provide a
signi cant performance bene t.
The closure operation recursively evaluates the lambda expression on each
element and its results and returns them as a attened collection. This is
theoretically parallelisable by using a Fork/Join (divide-and-conquer) approach, and we
leave this as future work.
5</p>
    </sec>
    <sec id="sec-5">
      <title>Evaluation</title>
      <p>In this section, we evaluate our parallel solutions for both correctness and
performance. Resources for our experiments can be found on GitHub6.
6 github.com/epsilonlabs/parallel-erl
5.1</p>
      <sec id="sec-5-1">
        <title>Correctness</title>
        <p>
          We tested both the sequential and parallel variants of each rst-order operation
using EUnit [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ], which is a framework for testing model management programs
using familiar unit testing paradigms such as assertions. EUnit allows us to write
test operations in EOL in a similar manner to writing JUnit tests. We create
a test for each rst-order operation over a simple data set (e.g. a sequence of
integers from 0 to 10) and ensure that we exercise all outcomes where possible.
For example, when testing the selectOne operation, we write two queries: one
where there are is at least one possible result and one with no results. Similarly
our tests for operations which return a Boolean, such as forAll and exists, we
ensure that both true and false possibilities are tested. For short-circuiting
operations such as exists and nMatch, we manually ensured that no unnecessary
evaluations occur using print statements in the code. We compare the results of
the operations to some expected outcome for both the sequential and parallel
versions. Of course for the selectOne and rejectOne operations, we ensure that
only one possible solution exists for consistency of the parallel variants.
        </p>
        <p>In addition, we also test for equivalence between the parallel and sequential
variants using a di erent set of tests with di erent data types (which may also
be null ). We also test that an exception is thrown correctly (with the expected
stack trace and detail message in accordance with the sequential implementation)
by referencing a non-existent property in the lambda expression for each
rstorder operation. Finally, for completeness we also test the visibility of variables
outside the scope of the lambda expression parameters and also invocation of
both user-de ned and built-in operations on the iteration variable. This also
includes nested rst-order operations.</p>
        <p>Epsilon also has a large test suite which also makes use of rst-order
operations. After running these tests many times, our experiments revealed no
incorrect results, failed assertions, deadlocks or unexpected exceptions.
5.2</p>
      </sec>
      <sec id="sec-5-2">
        <title>Performance</title>
        <p>
          Given the large number of rst-order operations and the computational
similarities between many of them, our benchmarks exercise the parallelSelect and
parallelSelectOne, since most of the other predicate-based operations can be
expressed in terms of these two operations7. For our test script and model, we
took inspiration from the \Find Couples" transformation in [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ] into a query.
The script attempts to nd all pairs of people who played together in at least
three movies, using a simple model of the Internet Movie Database.
        </p>
        <p>
          The select benchmark query consists of two select operations, one nested
inside another. The outer select's source is all Person instances (i.e. actors and
actresses), whilst the inner's source is the set of co-actors for a given person. In
our parallelSelect benchmark, we only parallelise the outer select, relying on the
7 Evaluation of other operations will be considered in future work
complexity of the inner computation and the number of Person instances to
provide su cient bene t from our data-parallel approach. We ran this experiment
using the largest model from [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ], which contains over 3.5 million elements.
        </p>
        <p>In our parallelSelectOne query, we iterate through all Movie instances, nd
all of the co-actors for all Persons in the Movie, and assert whether there exists a
Person whose name and hash code matches the mid-most Person in the collection
of Person.allInstances(). We should also note that unlike the select benchmark,
we used a cached operation for nding the co-actors in this experiment. As this
is a much more demanding query, we ran this with a smaller model.</p>
        <p>We ran each experiment three times in separate JVM invocations, and took
the mean average time8. We exclude the time taken for parsing the model, which
on average took approximately 40 seconds for the largest model with over 3.53
million elements. For our parallel variants, we used as many threads as physical
cores in the system. The test environment for our experiments was as follows:</p>
        <p>AMD Ryzen Threadripper 1950X (stock clocks, \Creator Mode") 16-core /
32-thread CPU, 32 (4x8) GB DDR4 RAM @ 3003 MHz, Fedora 28 OS (Linux
kernel 4.16), Oracle JDK 10.0.1, JVM options: \-XX:MaxGCPauseMillis=500
-XX:+AggressiveOpts -XX:+UseNUMA -Xmx32g -Xms4g "</p>
        <p>Table 2 shows that although execution time is signi cantly improved, the
speedup is not consistent with the number of cores. This can perhaps be
attributed to the complex architecture of Threadripper, which is e ectively four
quad-core modules (CCXes) split across two dies on a single chip. The lack of
shared cache between each module and CCX switching, which moves threads
between cores (which may be on a di erent die) to balance temperatures and the
non-uniform memory access (NUMA) nature, could all potentially contribute to
this slowdown. To investigate whether the poor speedups were due to a aw in
our parallel algorithm or experiments, we ran the benchmarks on smaller models
(due to memory constraints) using a more conventional system: Intel Core
i53470 quad-core CPU, 12 GB DDR3-1600 MHz RAM, Windows 10 64-bit v1803.
We used the following JVM ags: \-Xms640m -XX:MaxRAMPercentage=95
XX:MaxGCPauseMillis=500 -XX:+AggressiveOpts "</p>
        <p>The results in Table 3 show a marked improvement in e ciency of the
parallel variants compared to the Threadripper system. We see that our selectOne
8 We intend to perform more runs and remove outliers in future experiments
benchmark shows particularly favourable results compared to our select one,
with a more impressive speedup which increases with the number of elements.
Given that both parallelSelect and parallelSelectOne use the same
parallelisation strategy (i.e. the number of jobs submitted to the thread pool is equal to
the number of elements), one possible explanation for this disparity in speed-up
could be that our parallelSelectOne benchmark is more compute-intensive.</p>
        <p>In both the 4 and 16-core systems, we see virtually no signi cant di erence in
performance between parallelSelect and parallelSelectOrdered. We also see that
our solution appears to have a \speed-up limit" of 6x, since the performance
di erence between parallelSelect and parallelSelectOne is much smaller with 16
threads than with 4 threads. We intend to investigate this in future work.
6</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Conclusions and future work</title>
      <p>In this paper, we have motivated the case for the parallel execution of
iterationbased operations on collections, arguing that the functional nature of these
operations makes them particularly suitable for data parallelism. Furthermore, we
have seen how the computational similarity between certain operations gives
rise to re-usable patterns for implementing parallel variants of these operations.
More speci cally, we discovered that short-circuiting predicate logic operations
require a di erent implementation strategy than non short-circuiting operations.
Finally, we presented the challenges and solutions in implementing parallel
execution algorithms for a diverse set of rst-order operations in the context of a
feature-rich model-oriented scripting language. Our evaluation showed a signi
cant reduction in execution time, with no perceivable di erence between ordered
and unordered parallelisation strategies.</p>
      <p>
        A notable omission from our parallel operations is closure, which could also
be parallelised. The aggregate or OCL's iterate operation are perhaps the most
powerful and complex of the operations, so it would be interesting to provide
parallel implementations for a restricted subset of their use cases. We also
intend to perform a more thorough evaluation in the future and identify potential
performance bottlenecks. Perhaps more fundamentally we would ideally like to
achieve a combination of both parallelism and laziness, as provided by Java
Streams API. Future work could therefore investigate how lazy iterator-based
collections { as proposed in [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] { can be used in conjunction with the parallel
algorithms outlined in this paper, and the associated technical challenges.
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Kolovos</surname>
            ,
            <given-names>D.S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Paige</surname>
            ,
            <given-names>R.F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Polack</surname>
            ,
            <given-names>F.A.C.</given-names>
          </string-name>
          :
          <article-title>Scalability: The Holy Grail of Model Driven Engineering</article-title>
          .
          <source>In: Proceedings of the First International Workshop on Challenges in Model Driven Software Engineering</source>
          , Toulouse, pp.
          <fpage>10</fpage>
          -
          <lpage>14</lpage>
          (
          <year>2008</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Kolovos</surname>
            ,
            <given-names>D.S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rose</surname>
            ,
            <given-names>L.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Matragkas</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Paige</surname>
            ,
            <given-names>R.F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Guerra</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cuadrado</surname>
            ,
            <given-names>J.S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>De Lara</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rath</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Varro</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tisi</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cabot</surname>
            ,
            <given-names>J.:</given-names>
          </string-name>
          <article-title>A research roadmap towards achieving scalability in model driven engineering</article-title>
          .
          <source>In: Proceedings of the Workshop on Scalability in Model Driven Engineering</source>
          , Budapest. Article No.
          <volume>2</volume>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>Java</given-names>
            <surname>Streams</surname>
          </string-name>
          <string-name>
            <surname>API</surname>
          </string-name>
          , https://docs.oracle.com/javase/10/docs/api/ java/util/stream/package-summary.html
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Subramaniam</surname>
          </string-name>
          , V.:
          <article-title>Lets Get Lazy: Explore the Real Power of Streams, Devoxx United States (</article-title>
          <year>2017</year>
          ). Available at: https://youtu.be/F73kB4XZQ4I
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Tisi</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Douence</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wagelaar</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>Lazy evaluation for OCL</article-title>
          .
          <source>In: Proceedings of the 15th International Workshop on OCL and Textual Modeling co-located with 18th International Conference on Model Driven Engineering Languages and Systems</source>
          , Ottawa. pp.
          <fpage>46</fpage>
          -
          <lpage>61</lpage>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Vajk</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>David</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Asztalos</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mezei</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Levendovszky</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>Runtime model validation with parallel object constraint language</article-title>
          .
          <source>In: Proceedings of the 8th International Workshop on Model-Driven Engineering, Veri cation and Validation</source>
          , Wellington. Article No.
          <volume>7</volume>
          (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Paige</surname>
            ,
            <given-names>R.F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kolovos</surname>
            ,
            <given-names>D.S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rose</surname>
            ,
            <given-names>L.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Drivalos</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Polack</surname>
            ,
            <given-names>F.A.C.</given-names>
          </string-name>
          :
          <article-title>The Design of a Conceptual Framework and Technical Infrastructure for Model Management Language Engineering</article-title>
          .
          <source>In: Proceedings of the 2009 14th IEEE International Conference on Engineering of Complex Computer Systems</source>
          , Potsdam. pp.
          <volume>162</volume>
          {
          <issue>171</issue>
          (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Kolovos</surname>
            ,
            <given-names>D.S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Paige</surname>
            ,
            <given-names>R.F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Polack</surname>
            ,
            <given-names>F.A.C.</given-names>
          </string-name>
          :
          <article-title>The Epsilon Object Language (EOL)</article-title>
          .
          <source>In: Proceedings of the Second European Conference on Model Driven Architecture { Foundations and Applications</source>
          , Bilbao. pp.
          <volume>128</volume>
          {
          <issue>142</issue>
          (
          <year>2006</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Kolovos</surname>
            ,
            <given-names>D.S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Paige</surname>
            ,
            <given-names>R.F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Polack</surname>
            ,
            <given-names>F.A.C.</given-names>
          </string-name>
          :
          <article-title>On the evolution of OCL for capturing structural constraints in modelling languages</article-title>
          .
          <source>Rigorous Methods for Software Construction and Analysis</source>
          , pp.
          <volume>204</volume>
          {
          <fpage>218</fpage>
          . Springer Berlin, Heidelberg (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Madani</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kolovos</surname>
            ,
            <given-names>D.S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Paige</surname>
            ,
            <given-names>R.F.</given-names>
          </string-name>
          :
          <article-title>Parallel Model Validation with Epsilon</article-title>
          .
          <source>In: Proceedings of the 14th European Conference on Modelling Foundations and Applications</source>
          , Toulouse. pp.
          <volume>115</volume>
          {
          <fpage>131</fpage>
          . Springer, Cham (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>11. Futures in Java, https://docs.oracle.com/javase/10/docs/api/java/ util/concurrent/Future.html</mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Garc</surname>
          </string-name>
          a
          <article-title>-Dom nguez, A</article-title>
          .,
          <string-name>
            <surname>Kolovos</surname>
            ,
            <given-names>D.S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rose</surname>
            ,
            <given-names>L.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Paige</surname>
            ,
            <given-names>R.F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Medina-Bulo</surname>
          </string-name>
          , I.:
          <article-title>EUnit: a unit testing framework for model management tasks</article-title>
          .
          <source>In: Proceedings of the 14th international conference on Model driven engineering languages and systems, Wellington</source>
          . pp.
          <volume>395</volume>
          {
          <fpage>409</fpage>
          . Springer Berlin, Heidelberg (
          <year>2011</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <article-title>LinTra IMDb case study</article-title>
          , http://atenea.lcc.uma.es/index.php/Main_ Page/Resources/LinTra#IMDb
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>