<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Demo: Using TAU for Performance Evaluation of Scientific Software</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>III. DEMONSTRATION</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Allen D. Malony Department of Computer and Information Science University of Oregon Eugene</institution>
          ,
          <addr-line>OR 97403</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Sameer Shende Performance Research Laboratory University of Oregon Eugene</institution>
          ,
          <addr-line>OR 97403</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>-This paper presents the demonstration of the TAU Performance System for performance evaluation of Scientific Software written in C++, C, and Fortran. Index Terms-TAU, instrumentation, performance analysis, PDT, measurement.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>II. PERFORMANCE EVALUATION</title>
      <p>Given the diversity of performance problems, evaluation
methods, and types of events and metrics, the instrumentation
and measurement mechanisms needed to support performance
observation must be flexible, to give maximum opportunity
for configuring performance experiments, and portable, to
allow consistent cross-platform performance problem solving.
In general, flexibility in empirical performance evaluation
implies freedom in experiment design, and choices in selection
and control of experiment mechanisms. Using tools that
otherwise limit the type and structure of performance methods
will restrict evaluation scope. Portability, on the other hand,
looks for common abstractions in performance methods and
how these can be supported by reusable and consistent
techniques across different computing environments (software
and hardware).</p>
      <p>The TAU parallel performance system is the product of over
two decades of development to create a robust, flexible,
portable, and integrated framework and toolset for
performance instrumentation, measurement, analysis, and
visualization of large-scale parallel computer systems and
applications. The architecture of TAU is shown in Fig. 1.</p>
      <p>
        The demo will highlight the instrumentation of MPI
programs on the NSF XSEDE system, Stampede, at TACC. It will
demonstrate how TAU may be used to insert instrumentation in
the source code using the C, C++, and Fortran parsers from the
Program Database Toolkit (PDT) with TAU compiler scripts
that may be used in place of compiler scripts provided by MPI.
It will show to execute programs on the Intel® Xeon PhiTM
systems and generate profiles that will be loaded in TAU’s
ParaProf 3D browser as shown in Fig. 2. These profiles may be
stored in TAUdb, a performance database and analyzed using
TAU’s PerfExplorer tool [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] for cross-platform scalability
studies and performance data mining. TAU uses PAPI [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]
internally to access low-level hardware performance counters such as
floating point instructions, level 1 and 2 data cache misses, and
vector instructions executed in the code. Using these counters,
TAU can show the extent of loop vectorization as shown in
Fig. 3. TAU’s ParaProf browser can show the time spent in
each routine on all threads in its main window as shown in
Fig. 4. It can also show the communication matrix as shown in
Fig. 5 and a thread statistics window as shown in Fig. 6. TAU
can support automatic instrumentation for code written in C,
C++, Fortran, Java, and Python. It can be easily integrated in
the build system of application frameworks and be enabled at
compile-time using specially designed compiler scripts. TAU
also supports instrumentation during program execution using
preloading of TAU’s Dynamic Shared Object (DSO) in the
address space of the executing application. Using tau_exec, a
user may evaluate the performance of an un-instrumented
application. This includes memory, I/O, communication
performance as well as event-based sampling to show the
contribution at the statement level. TAU supports a variety of runtime
systems used in HPC including OpenSHMEM, MPI, MPC,
OpenMP, pthread, OpenCoArrays, CUDA, OpenCL, and
OpenACC. The demo will show the use of TAU for
performance engineering of software used in HPC.
      </p>
    </sec>
    <sec id="sec-2">
      <title>IV. CONCLUSION</title>
      <p>The TAU performance system addresses performance
technology problems at three levels: instrumentation,
measurement, and analysis. The TAU framework supports the
configuration and integration of these layers to target specific
performance problem solving needs. However, effective
exploration of performance will necessarily require prudent
selection from the range of alternative methods TAU provides
to assemble meaningful performance experiments that sheds
light on the relevant performance properties. To this end, the
TAU performance system offers support to the performance
analysis in various ways, including powerful selective and
multi-level instrumentation, profile and trace measurement
modalities, interactive performance analysis analysis, and
performance data management.</p>
    </sec>
    <sec id="sec-3">
      <title>ACKNOWLEDGMENT</title>
      <p>This work was supported by the National Science
Foundation (NSF) grant number ACI-1450471. This work used the
Extreme Science and Discovery Environment (XSEDE) that is
supported by the NSF grant number ACI-1053575 and used
allocation TG-ASC090010.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>S.</given-names>
            <surname>Shende</surname>
          </string-name>
          and
          <string-name>
            <given-names>A. D.</given-names>
            <surname>Malony</surname>
          </string-name>
          , “
          <article-title>The TAU Parallel Performance System,” IJPCA</article-title>
          , Vol
          <volume>20</volume>
          , No.
          <issue>2</issue>
          , pp.
          <fpage>287</fpage>
          -
          <lpage>311</lpage>
          ,
          <year>2006</year>
          . http://tau.uoregon.edu.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>K.</given-names>
            <surname>Huck</surname>
          </string-name>
          and
          <string-name>
            <given-names>A. D.</given-names>
            <surname>Malony</surname>
          </string-name>
          , “
          <article-title>PerfExplorer: A Performance Data Mining Framework for Large-Scale Parallel Computing,”</article-title>
          <source>Proc. SC'</source>
          <year>2005</year>
          , ACM, IEEE,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>U.</given-names>
            <surname>Tennessee</surname>
          </string-name>
          ,
          <article-title>Performance Application Programming Interface</article-title>
          , http://icl.cs.utk.edu/papi,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>M.</given-names>
            <surname>Geimer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Shende</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Wesarg</surname>
          </string-name>
          , and
          <string-name>
            <given-names>B.</given-names>
            <surname>Wylie</surname>
          </string-name>
          , “
          <article-title>Practical Hybrid Parallel Application Performance Engineering,” Tutorial</article-title>
          ,
          <source>SC'15</source>
          ,Austin,TX, http://sc15.supercomputing.org/schedule/event_detailevid=tut117.html, Nov.
          <volume>16</volume>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>