=Paper= {{Paper |id=Vol-1337/paper26 |storemode=property |title=Towards a Characterisation of Parallel Functional Applications |pdfUrl=https://ceur-ws.org/Vol-1337/paper26.pdf |volume=Vol-1337 |dblpUrl=https://dblp.org/rec/conf/se/BelikovLM15 }} ==Towards a Characterisation of Parallel Functional Applications== https://ceur-ws.org/Vol-1337/paper26.pdf
                            Towards a Characterisation of
                           Parallel Functional Applications∗

                  Evgenij Belikov               Hans-Wolfgang Loidl                  Greg Michaelson
                                   School of Mathematical and Computer Sciences
                                       Heriot-Watt University, Edinburgh, UK
                                   {eb120 | H.W.Loidl | G.Michaelson}@hw.ac.uk




                                                        Abstract

                       To devise novel parallelism control mechanisms further insights into dy-
                       namic behaviour of parallel functional programs and run-time systems
                       (RTS) are needed. We use profiling to characterise eight applications
                       on a multi-core and on a multi-core cluster. We focus on thread granu-
                       larity, memory management and communication. Our results confirm
                       that parallel Haskell implementations cope well with large numbers of
                       potential threads, identify memory management overhead as a key lim-
                       iting factor in a shared-memory RTS, whilst in the distributed RTS, the
                       amount of sharing determines the dominant communication overhead.


1    Introduction
Currently, parallelism is a key source of gains in application performance as mainstream architectures feature a
growing number of processing elements (PEs) since single-core performance stopped increasing due to physical
limitations. However, exploiting parallelism remains a major challenge, exacerbated by the diversity of archi-
tectures that rapidly evolve towards heterogeneous and hierarchical designs with non-uniform memory access
(NUMA) and interconnect [2, 5]. A key issue for effective parallel programming is the increased complexity of
specifying parallelism management and coordination in addition to a correct and efficient algorithm to solve a
particular problem in a given domain. Hence, a desirable solution should achieve performance portability across
a wide range of architectures without sacrificing programmer productivity [15].
   Although functional languages have numerous widely recognised benefits such as high level of abstraction for
productivity, higher-order functions for composability, and purity which facilitates parallelisation and allows for
sequential debugging [20, 7, 9, 19, 6], among others, tuning of parallelism control remains challenging, since the
operational cost model is often non-intuitive because low-level coordination decisions are taken implicitly by the
RTS [10]. For instance, exploiting all the fine-grained parallelism inherent in functional programs would result
in prohibitive thread management overhead, whilst too coarse granularity leads to load imbalance, thus reducing
performance and limiting scalability. Thus manual tuning appears infeasible due to rapid architectural evolution,
mandating automated solutions.
   Characterising dynamic behaviour of eight parallel functional programs on a modern 48-core machine and
a cluster consisting of 8-core nodes (we use up to 64 cores in total) is a step forward in gathering insight into
attributes that can be exploited to improve the effectiveness and efficiency of parallelism management and useful
for comparing run-time systems. The results confirm that the parallel Haskell [16] implementations cope well with
large numbers of potential threads, quantify the impact of sharing (heap fragmentation due to distributed data
structures) on communication overhead in a distributed-memory RTS (GHC-GUM) [18], and identify memory
management overhead as a key limiting factor to scalability for a shared-memory RTS (GHC-SMP) [12].
    ∗This work is partly supported by the Scottish Informatics and Computer Science Alliance (SICSA).




                                                             146
2     Parallel Applications
Most applications1 were adopted from [14, 11] and grouped by the used parallelism pattern. We investigate how
program characteristics change across different run-time systems and architectures with varying number of PEs.
   Five applications use the Divide-an-Conquer (D&C) pattern, where a problem is recursively split into sub-
problems that are solved and the results combined to form the final result. A threshold value can be used to
restrict the depth of a tree to a certain level from which on the problem is solved sequentially.
    • The regular and flat (i.e. not nested) parfib computes the number of function calls for computation of
      the N th Fibonacci number using arbitrary-length integers (input: N = 50, threshold of 23); the naive
      exponential implementation is primarily aimed at assessing thread subsumption capabilities of the RTS.
    • The worpitzky program, from the domain of symbolic computation, checks the Worpitzky identity for two
      given arbitrary-length integers (19 to the exponent of 27, threshold 10).
    • The queens program determines the number of solutions for placement of N queens on an N × N board
      without attacking each other (N = 16); a solution is a list of integers generated by discarding unsafe
      positions, which results in sharing of data structures at RTS level.
    • The coins program computes ways to pay out a specified amount (5777) from a given set of coins.
    • The minimax application calculates winning positions for a Noughts-vs-Crosses game on a N × N board up
      to a specified depth using alpha-beta search and lazyness to prune unpromising sub-trees (N = 4, depth 8).
   Three applications are data parallel, i.e. the parallelism is exploited by simultaneously applying a function to
the elements of a data structure. Explicit chunking can be used for explicit granularity tuning.
    • The sumeuler program computes the sum over Euler Totient numbers in a given integer interval and is
      fairly irregular ([0..100000], chunk 500); all the parallelism is generated in the beginning of the execution.
    • The mandelbrot application computes the fairly irregular Mandelbrot fractal set for a given range and image
      size as well as number of iterations (range [-2.0..2.0], 4096x4096 image size, 3046 iterations).
    • The maze program is a nested data-parallel AI application which searches for a path in a maze (of size 29).
   The benchmarks are implemented in Glasgow parallel Haskell [16], a dialect of Haskell [8], which implements
semi-explicit annotation-based model of parallelism where most of parallelism management is implicit. Although
similar benchmarks have been used in the past, we run measurements on configurations with significantly larger
number of PEs which enables us to better assess scalability, in particular on a distributed-memory platform.
We show that where previously almost linear scaling has been reported on few cores, scalability often reaches a
plateau well before all PEs are fully utilised on modern server-class multi-cores and multi-core clusters.
   Additionally, we use larger input sizes and obtain more detailed profiles. We have extended the profiling
infrastructure to record per-lightweight-thread granularity information as well as to collect more detailed summary
statistics on protocol messages and sizes (specific to the distributed RTS). Relevant background information on
high-level programming models and RTS-level policies can be found in [3]. Both run-time systems [18, 12]
implement a variant of work-stealing for load balancing, where idle PEs ask randomly chosen PEs for work [4].

3     Application Characterisation
We report relative speedups and profiles from a median run out of three on a dedicated multi-core and on a lightly
loaded cluster of multi-cores as as we are primarily interested in the parallelism behaviour of the applications.
   The 48-core machine (cantor) consists of four AMD Opteron processors with two NUMA nodes with six
2.8GHz cores each. Every two cores share 2MB L2 cache and all six cores on a NUMA-node share 6MB L3
cache and 64GB RAM (512GB in sum). Average memory latency is 16ns with up to 3x difference depending on
the NUMA regions. The beowulf cluster comprises of 8-core Xeon 5504 nodes with two sockets with four 2GHz
cores each, using 256 KB L2 cache, and 4MB shared L3 cache and 12GB RAM, and 8-core Xeon 5450 nodes
with two sockets with four 3GHz cores each, using 6MB shared L2 cache and 16GB RAM, connected via Gigabit
Ethernet with average latency of 150ns. We use CentOS 6.5, GHC 6.12.32 , gcc 4.4.7, and PVM 3.4.6.
    1 Application source code can be obtained from http://www.macs.hw.ac.uk/
                                                                               ~eb96/atps15benchmarks.tar.gz or via email.
    2 Some experiments using GHC 7.6.3 show similar trends for SMP; GUM has not yet been ported to more recent GHC versions.




                                                             147
3.1   Performance and Scalability
We fix the input size and increase the number of PEs to assess scalability. Run time decreases as the applications
are able to profitably exploit parallelism resulting in an order of magnitude reduction in execution time for 5
programs. The exceptions are queens due to excessive memory use, maze which generates more work with
increasing PE numbers, and GHC-SMP3 runs on higher numbers of PEs which indicates a scalability issue.




                                          Figure 1: Application Execution Times
   In Figure 1 we observe strong scaling for parfib and coins for GUM and good scaling for sumeuler with load
balancing issues on high numbers of PEs. GUM scales up to 64 PEs in most cases, although often the benefit
of adding PEs decreases with PE number due to increasing overhead and reduced work per PE. SMP shows
best performance for relatively low number of PEs, whilst on 48 cores a memory management issue discussed in
Section 3.3 leads to a slowdown for 5 out of 8 programs. Moreover, queens, mandelbrot, and minimax exhibit
limited scalability due to excessive communication and heap residency, which hints at improvement potential
at application level. Increasing granularity for worpitzky is deemed likely to improve performance as currently
median thread size is very small and the number of threads very high compared to programs that scale better.

3.2   Granularity
We have extended run-time profiling capabilities of GUM and SMP to record thread granularity information4 .
Unlike SMP, GUM RTS instances maintain private heaps and thus avoid GC-related synchronisation overhead.
In Figure 2 we present application granularity profiles5 , grouped by the RTS and architecture. A log-log scale is
used for comparability due to orders of magnitude differences in thread sizes (x-axis) and numbers (y-axis).
   For parfib, coins, and worpitzky, we observe an order of magnitude less and larger threads for GUM than
for SMP, which demonstrates effectiveness of GUM’s thread subsumption mechanism and the aggressiveness of
SMP’s thread creation for D&C applications. Subsumption allows to implement advisory parallelism where the
RTS decides whether to inline child threads into the parent or execute them in parallel, similar to lazy futures [13].
The shapes of the profiles for GUM on cantor are to a large extent similar to the shape of the profile on beowulf,
but differs distinctively from SMP profiles, suggesting that RTS-level parallelism management policies have a
strong influence on granularity, especially if the architectural features are not explicitly taken into account.
  3 We use SMP and GUM as a shorthand for GHC-SMP (shared-memory RTS) and GHC-GUM (distributed RTS), respectively.
  4 Profiling overhead is negligible as it involves mostly counters and is amortised by other dominant overheads (e.g. GC).
  5 Application names are found in the upper right corner of each histogram, so that three rows in a column represent an application.




                                                               148
Figure 2: Run Time Granularity (GUM vs SMP on cantor using 48 PEs)




                               149
   Parallelism is often over-abundant and fine-grained in functional programs, leaving considerable parallel slack-
ness and requiring an effective thread subsumption mechanism. We observe a wide range of actual and potential
parallelism degrees across applications. For instance, sumeuler only has 200 potential threads which is insuffi-
cient to keep all PEs busy, whereas for coins and worpitzky there are four orders of magnitude more potential
threads, most of which are pruned at run time. GUM appears well-suited for D&C applications and is able to
subsume threads to a larger extent than SMP which creates threads more aggressively. The systems automatically
adapt the degree of actual parallelism to the number of the available PEs.
   By contrast, thread subsumption is ineffective in flat data-parallel applications, where parallelism is created at
the start of the computation. Hence, to exploit the subsumption mechanism, data-parallel applications appear
to require nested parallelism. We observe optimisation potential for D&C applications: recognising and inlining
smaller threads as well as reducing the spread of the granularity distribution and the number of threads whilst
preserving larger threads.

3.3   Memory Use and Garbage Collection
Many parallel functional programs are memory-bound as they perform graph reduction. Figure 3 depicts GC
overhead – a reason for scalability issues for SMP. The GC% increases consistently across all applications for
SMP and results in severe contention on the first generation heap. By contrast, GUM starts off with higher
GC% which then drops or remains constant in most cases. This highlights the benefit of a distributed-memory
design on shared-memory architectures by avoiding some of the synchronisation, which pays off particularly for
applications with low communication rate. In addition to GC%, allocation rate signifies computational intensity
of each application and can be used to compare aggressiveness of work allocation across run-time systems. In
Figure 4 we observe initially higher allocation rates for SMP that are then dropping faster than for GUM.




                                     Figure 3: Garbage Collection Overhead




                                            Figure 4: Allocation Rates




                                                        150
   Application working sets are represented by heap residency. We observe roughly constant or decreasing
residency for GUM on both distributed- and shared-memory architectures (except for minimax), whilst for SMP
the residency is growing in most cases, as due to contention some heap-allocated objects are retained for longer.

3.4   Sharing and Communication
GUM uses virtual shared memory, so each RTS instance maintains a Global Address (GA) table of stable
inter-processor pointers which are used as roots for GC. Fragmentation of the shared heap can lead to decreased
performance since excessive sharing results in higher GA residency and reduced locality, which leads to additional
communication overhead. Thus GA residency can be used as an indicator of the degree of virtual shared heap
fragmentation. Based on this metric, our application set can be partitioned into two classes: most of the
applications shown in Figure 5, exhibit a moderate GA residency of at most 600 per PE; in contrast to this
behaviour, worpitzky reaches a value of 2500 for a large number of PEs, and even worse mandelbrot (not
shown) reaches a GA residency of 8000, and queens (not shown) of over 250000. This points to a high degree
of sharing in the program due to poor data distribution and locality, which incurs a lot of communication and
becomes a bottleneck for parallel performance, hinting at potential for application-level optimisation.




                       Figure 5: Global Address Table Residency (Heap Fragmentation)
   As shown in Figure 6, for parfib, coins, maze, and to lesser extent minimax and sumeuler we observe modest
linear increase in communication rate with less than 40% of work request messages. The median of the graph sent
usually increases slightly, but only for queens it is excessive with over 14MB median graph sent per mutation
second on 48 cores, as communication rate skyrockets (840k messages on 48 cores with frequent very long fetches
(when a light-weight thread blocks waiting for another), with only 15% of the messages being work requests).




                             Figure 6: Communication Rate Comparison for GUM




                                                       151
   We are currently investigating ways to eliminate these overheads. Next highest communication rate is for
worpitzky with over 100k messages sent on 48 cores, almost 50% of which are work requests, due to very fine
thread granularity. Then sumeuler follows, illustrating another issue — lack of inherent parallelism for the given
chunk size leads to load imbalance on higher number of PEs demonstrated by over 95% of sent messages being
requests for work, which also coincides with decreasing memory residency and low allocation rate.
   For most applications the number of packets sent increases linearly and reflects the size of shared graph, whilst
packet size is mostly very small and constant (in the range between 5 and 50 bytes), except for queens (4k) and
mandelbrot (ca. 9k). Due to space limitation we present no graphs on these metrics. We find that in general
packets are smallest for integer-based programs and data-parallel programs often have larger packets than D&C
programs. Communication rate appears to correlate with heap fragmentation (GA residency) and the percentage
of work requests of the total number of messages seems to indicate the degree of load imbalance.
   Careful parallelisation is required to avoid pitfalls that result in excessive overhead and limit scalability. We
have found that semi-explicit parallel functional programs have the potential to scale, provided there is enough
work on the one hand, but that the granularity and sharing are adequately controlled, on the other.

4     Conclusions
We have characterised a set of small and medium-sized parallel functional applications run on a multi-core and on
a cluster of multi-cores in terms of scalability, granularity, GC overhead, global address table residency, allocation
rate and communication rate, among other metrics. We have found that profiling reveals diverse bottlenecks and
helps gain insight into dynamic application behaviour. In particular, the results indicate that:
    • Subsumption works well across architectures and D&C applications, as the RTS is able to handle a large
      number of potential light-weight threads and prune superfluous parallelism by subsuming computations into
      the parent thread. For data-parallel programs nesting appears to be necessary to benefit from subsumption.
    • Compared to SMP, GUM is less aggressive in instantiating parallelism, i.e. generates fewer threads of larger
      granularity, adapting to the number of PEs and to system latency (the higher the latency, the lazier the
      instantiation). Additionally, granularity varies across run-time systems but seems relatively similar for the
      same RTS across different architectures.
    • High GA residency is a good indicator of heap fragmentation due to sharing, which in turn causes a high
      degree of communication, limiting the scalability of the application.
    • Communication rate and GA residency vary considerably across applications and have a high, direct impact
      on parallel performance.
    • System-level information (e.g. granularity profiles, GA residency representing heap fragmentation and the
      fraction of work requests in relation to the total number of messages) appears promising for improving
      dynamic policy control decisions at RTS level.
    • Increased memory residency and GC-percentage in a shared-memory design point to a scalability issue due
      to contention on the first generation heap, in contrast to a distributed-memory design, confirming the results
      from [1]. This suggests higher scalability potential of the RTS design that uses private heaps.
   The insights from this characterisation inform the design of a dynamic adaptation mechanism, based on
monitoring a set of relevant parameters and dynamically tuning related policies. For instance, we are currently
implementing a co-location mechanism for potential threads that leverages ancestry information to encourage
stealing work from the nearby sources to improve locality and reduce fragmentation of the shared heap for
applications with multiple sources of parallelism. Additionally, architectural information on communication
latency and computational power of different PEs could be used to further improve co-location decisions [17].
   Ultimately, we envision a growing set of representative parallel functional benchmark applications that could
be used (similar to mainstream languages) to compare different novel parallelism management policies and mech-
anisms, aiming at high performance portability across heterogeneous architectures whilst retaining productivity.

Acknowledgements
We are grateful to Natalia Chechina, Julian Godesa, Rob Stewart, and Prabhat Totoo for inspiring discussions
and to the anonymous reviewers who helped improve the presentation and the discussion of the results.




                                                         152
References
 [1] M. Aljabri, H.-W. Loidl, and P. Trinder. Distributed vs. shared heap, parallel Haskell implementations on
     shared memory machines. In Proc. of Symp. on Trends in Functional Programming, University of Utrecht,
     The Netherlands, 2014. To appear.
 [2] K. Asanovic, R. Bodik, J. Demmel, T. Keaveny, K. Keutzer, J. Kubiatowicz, N. Morgan, D. Patterson,
     K. Sen, J. Wawrzynek, D. Wessel, and K. Yelick. A view of the parallel computing landscape. CACM,
     52:56–67, October 2009.
 [3] E. Belikov, P. Deligiannis, P. Totoo, M. Aljabri, and H.-W. Loidl. A survey of high-level parallel program-
     ming models. Technical Report HW-MACS-TR-0103, Heriot-Watt University, December 2013.
 [4] R. Blumofe and C. Leiserson. Scheduling multithreaded computations by work stealing. Journal of the
     ACM, 46(5):720–748, September 1999.
 [5] A. Brodtkorb, C. Dyken, T. Hagen, J. Hjelmervik, and O. Storaasli. State-of-the-art in heterogeneous
     computing. Scientific Programming, 18(1):1–33, May 2010.
 [6] K. Hammond. Why parallel functional programming matters: Panel statement. In A. Romanovsky and
     T. Vardanega, editors, Ada-Europe 2011, volume 6652 of LNCS, pages 201–205, 2011.
 [7] P. Hudak. Conception, evolution, and application of functional programming languages. ACM Computing
     Surveys, 21(3):359–411, September 1989.
 [8] P. Hudak, J. Hughes, S. Peyton Jones, and P. Wadler. A history of Haskell: being lazy with class. In Proc.
     of the 3rd ACM SIGPLAN History of Programming Languages Conference, pages 1–55, June 2007.
 [9] J. Hughes. Why functional programming matters. Research Directions in Functional Programming, pages
     17–42, 1990.
[10] H.-W. Loidl, P. Trinder, K. Hammond, S. Junaidu, R. Morgan, and S. Peyton Jones. Engineering Parallel
     Symbolic Programs in GpH. Concurrency: Practice and Experience, 11:701–752, 1999.
[11] S. Marlow, P. Maier, H.-W. Loidl, M. Aswad, and P. Trinder. Seq no more: better Strategies for parallel
     Haskell. In Proc. of the 3rd Symposium on Haskell, Haskell ’10, pages 91–102. ACM, 2010.
[12] S. Marlow, S. Peyton Jones, and S. Singh. Runtime support for multicore Haskell. In ACM SIGPLAN
     Notices, volume 44, pages 65–78, 2009.
[13] E. Mohr, D.A. Kranz, and R.H. Halstead Jr. Lazy task creation: A technique for increasing the granularity
     of parallel programs. IEEE Transactions on Parallel and Distributed Systems, 2(3):264–280, July 1991.
[14] W. Partain. The nofib benchmark suite of Haskell programs. In Functional Programming, Glasgow 1992,
     pages 195–202. Springer, 1993.
[15] H. Sutter and J. Larus. Software and the concurrency revolution. ACM Queue, 3(7):54–62, 2005.
[16] P. Trinder, E. Barry Jr., M. Davis, K. Hammond, S. Junaidu, U. Klusik, H.-W. Loidl, and S. Peyton Jones.
     GpH: An architecture-independent functional language. IEEE Transactions on Software Engineering, 1998.
[17] P. Trinder, M. Cole, K. Hammond, H.-W. Loidl, and G. Michaelson. Resource analyses for parallel and
     distributed coordination. Concurrency and Computation: Practice and Experience, 25(3):309–348, 2013.
[18] P. Trinder, K. Hammond, J. Mattson Jr., A. Partridge, and S. Peyton Jones. GUM: a portable parallel
     implementation of Haskell. In Proc. of PLDI’96 Conf., 1996.
[19] P. Trinder, K. Hammond, H-W. Loidl, and S. Peyton Jones. Algorithm + Strategy = Parallelism. Journal
     of Functional Programming, 8(1):23–60, January 1998.
[20] S. Vegdahl. A survey of proposed architectures for the execution of functional languages. IEEE Transactions
     on Computers, 33(12):1050–1071, December 1984.




                                                      153