II. PERFORMANCE EVALUATION

Demo: Using TAU for Performance Evaluation of Scientific Software

III. DEMONSTRATION

0 1 0 Allen D. Malony Department of Computer and Information Science University of Oregon Eugene , OR 97403 , USA 1 Sameer Shende Performance Research Laboratory University of Oregon Eugene , OR 97403 , USA

-This paper presents the demonstration of the TAU Performance System for performance evaluation of Scientific Software written in C++, C, and Fortran. Index Terms-TAU, instrumentation, performance analysis, PDT, measurement.

Given the diversity of performance problems, evaluation methods, and types of events and metrics, the instrumentation and measurement mechanisms needed to support performance observation must be flexible, to give maximum opportunity for configuring performance experiments, and portable, to allow consistent cross-platform performance problem solving. In general, flexibility in empirical performance evaluation implies freedom in experiment design, and choices in selection and control of experiment mechanisms. Using tools that otherwise limit the type and structure of performance methods will restrict evaluation scope. Portability, on the other hand, looks for common abstractions in performance methods and how these can be supported by reusable and consistent techniques across different computing environments (software and hardware).

The TAU parallel performance system is the product of over two decades of development to create a robust, flexible, portable, and integrated framework and toolset for performance instrumentation, measurement, analysis, and visualization of large-scale parallel computer systems and applications. The architecture of TAU is shown in Fig. 1.

The demo will highlight the instrumentation of MPI programs on the NSF XSEDE system, Stampede, at TACC. It will demonstrate how TAU may be used to insert instrumentation in the source code using the C, C++, and Fortran parsers from the Program Database Toolkit (PDT) with TAU compiler scripts that may be used in place of compiler scripts provided by MPI. It will show to execute programs on the Intel® Xeon PhiTM systems and generate profiles that will be loaded in TAU’s ParaProf 3D browser as shown in Fig. 2. These profiles may be stored in TAUdb, a performance database and analyzed using TAU’s PerfExplorer tool [ 2 ] for cross-platform scalability studies and performance data mining. TAU uses PAPI [ 3 ] internally to access low-level hardware performance counters such as floating point instructions, level 1 and 2 data cache misses, and vector instructions executed in the code. Using these counters, TAU can show the extent of loop vectorization as shown in Fig. 3. TAU’s ParaProf browser can show the time spent in each routine on all threads in its main window as shown in Fig. 4. It can also show the communication matrix as shown in Fig. 5 and a thread statistics window as shown in Fig. 6. TAU can support automatic instrumentation for code written in C, C++, Fortran, Java, and Python. It can be easily integrated in the build system of application frameworks and be enabled at compile-time using specially designed compiler scripts. TAU also supports instrumentation during program execution using preloading of TAU’s Dynamic Shared Object (DSO) in the address space of the executing application. Using tau_exec, a user may evaluate the performance of an un-instrumented application. This includes memory, I/O, communication performance as well as event-based sampling to show the contribution at the statement level. TAU supports a variety of runtime systems used in HPC including OpenSHMEM, MPI, MPC, OpenMP, pthread, OpenCoArrays, CUDA, OpenCL, and OpenACC. The demo will show the use of TAU for performance engineering of software used in HPC.

IV. CONCLUSION

The TAU performance system addresses performance technology problems at three levels: instrumentation, measurement, and analysis. The TAU framework supports the configuration and integration of these layers to target specific performance problem solving needs. However, effective exploration of performance will necessarily require prudent selection from the range of alternative methods TAU provides to assemble meaningful performance experiments that sheds light on the relevant performance properties. To this end, the TAU performance system offers support to the performance analysis in various ways, including powerful selective and multi-level instrumentation, profile and trace measurement modalities, interactive performance analysis analysis, and performance data management.

ACKNOWLEDGMENT

This work was supported by the National Science Foundation (NSF) grant number ACI-1450471. This work used the Extreme Science and Discovery Environment (XSEDE) that is supported by the NSF grant number ACI-1053575 and used allocation TG-ASC090010.

[1]

Shende and

A. D.

Malony , “ The TAU Parallel Performance System,” IJPCA , Vol 20 , No. 2 , pp. 287 - 311 , 2006 . http://tau.uoregon.edu.

[2]

Huck and

A. D.

Malony , “ PerfExplorer: A Performance Data Mining Framework for Large-Scale Parallel Computing,” Proc. SC' 2005 , ACM, IEEE, 2005 .

[3]

Tennessee , Performance Application Programming Interface , http://icl.cs.utk.edu/papi, 2016 .

[4]

Geimer ,

Shende ,

Wesarg , and

Wylie , “ Practical Hybrid Parallel Application Performance Engineering,” Tutorial , SC'15 ,Austin,TX, http://sc15.supercomputing.org/schedule/event_detailevid=tut117.html, Nov. 16 , 2015 .