<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Performance Characterization of Scientific Workflows for the Optimal Use of Burst Buffers</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Christopher S. Daley, Devarshi Ghoshal, Glenn K. Lockwood, Sudip Dosanjh, Lavanya Ramakrishnan, Nicholas J. Wright. Lawrence Berkeley National Laboratory 1 Cyclotron Rd Berkeley</institution>
          ,
          <addr-line>CA</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2016</year>
      </pub-date>
      <fpage>69</fpage>
      <lpage>73</lpage>
      <abstract>
        <p>Scienti c discoveries are increasingly dependent upon the analysis of large volumes of data from observations and simulations of complex phenomena. Scientists compose the complex analyses as work ows and execute them on largescale HPC systems. The work ow structures are in contrast with monolithic single simulations that have often been the primary use case on HPC systems. Simultaneously, new storage paradigms such as Burst Bu ers are also becoming available on HPC platforms. In order to maximize the performance of data analyses work ows today it is critical to determine the characteristics of the work ows. Obtaining a deeper understanding of the work ows helps us identify opportunities to leverage the capabilities of the Burst Bu er. In this paper, we analyze the performance characteristics of the Burst Bu er and two representative scienti c workows. We measure the performance of these work ows using the Burst Bu er, allowing us to make recommendations for future optimal usage of work ows using Burst Bu er.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. INTRODUCTION</title>
      <p>
        The science drivers for high-performance computing
(HPC) are broadening with the proliferation of high-resolution
observational instruments and emergence of completely new
data-intensive scienti c domains. Scienti c work ows that
chain the processing and data are becoming critical to
manage these on HPC systems. Thus, while providers of
supercomputing resources must continue to support the
extreme bandwidth requirements of traditional
supercomputing applications, centers must now also deploy resources
that are capable of supporting the requirements of these
emerging data-intensive work ows. In sharp contrast to
the highly coherent, sequential, large-transaction reads and
writes that are characteristic of traditional HPC
checkpointrestart workloads [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], data-intensive work ows have been
shown to often utilize non-sequential, metadata-intensive,
and small-transaction reads and writes [
        <xref ref-type="bibr" rid="ref13 ref23">13, 23</xref>
        ]. Parallel le
systems in today's supercomputers have been optimized for
more traditional HPC workloads [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. The rapid growth in
I/O demands coming from data-intensive work ows are
demanding new performance and optimization requirements
of future HPC I/O subsystems [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. It is therefore
essential to develop methods to quantitatively characterize the
      </p>
      <sec id="sec-1-1">
        <title>Copyright held by the author(s).</title>
        <p>I/O needs of data-intensive work ows to ensure that
correct resources can be deployed with the correct balance of
performance characteristics.</p>
        <p>
          The emergence of data-intensive work ows has coincided
with the emergence of ash devices being integrated into
the HPC I/O subsystem as a \Burst Bu er", a
performanceoptimized storage tier that resides between compute nodes
and the high-capacity parallel le system (PFS). The Burst
Bu er was originally conceived for massive bandwidth
requirements of checkpoint-restart workloads for extreme-scale
simulation [
          <xref ref-type="bibr" rid="ref19">19</xref>
          ]. The tier bu ers bursts of I/O tra c to
enable the PFS to service a lower bandwidth load spread over a
longer time period. However, the ash-based storage media
underlying Burst Bu ers are also substantially faster than
spinning disk for the non-sequential and small-transaction
I/O workloads of data-intensive work ows. This motivates
using the media for use cases beyond bu ering of I/O
requests, such as providing a temporary scratch space,
coupling work ow stages, and in-transit processing [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ].
        </p>
        <p>
          Today's commercially available Burst Bu er solutions [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ]
expose their ash through the POSIX API which enables
work ows to easily leverage the technology's capabilities.
We need to understand and optimize the use of Burst Bu ers
to serve the needs of data-intensive work ows. Thus, it is
essential to understand work ows' speci c I/O requirements
in the context of both ash-based storage media and the I/O
stack through which applications utilize the Burst Bu er.
        </p>
        <p>In this paper, we characterize two of the production data
analytics work ows used at the National Energy Research
Scienti c Computing Center (NERSC) at Lawrence
Berkeley National Laboratory, and we present an analysis of their
performance on the production Burst Bu er resource
deployed as a part of NERSC's Cori system. The paper is
organized as follows. Section 2 presents the background for
the paper - related work and the details of the NERSC Burst
Bu er Architecture. Section 3 details our approach to
scalable I/O characterization for both work ows and Section 4
presents a detailed analysis of the I/O requirements of these
work ows. We discuss e cient use of Burst Bu ers in
Section 5 and provide conclusions in Section 6.
2.</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>BACKGROUND</title>
      <p>In this section we describe related work and the NERSC
Burst Bu er architecture.
2.1</p>
    </sec>
    <sec id="sec-3">
      <title>Related Work</title>
      <p>
        Scienti c Work ows. Data-intensive scienti c work ows
have been shown to process large amounts of data with
varied I/O characteristics [
        <xref ref-type="bibr" rid="ref16 ref21 ref7 ref9">16, 21, 9, 7</xref>
        ]. Deelman et al. [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]
highlights several challenges in data management for
dataintensive scienti c work ows. Several strategies have been
proposed to optimize data management for scienti c
workows in HPC environments [
        <xref ref-type="bibr" rid="ref20 ref28 ref8">28, 20, 8</xref>
        ]. However, Burst
Bu ers add another layer in the storage hierarchy, adding
to the data management challenges for scienti c work ows.
Hence, it is important to characterize scienti c work ows to
optimally use Burst Bu ers based on their I/O
characteristics. In this paper, we evaluate and characterize multiple
work ows with di erent I/O pro les to understand the
optimal use of Burst Bu ers.
      </p>
      <p>
        Burst Bu ers. Several uses of Burst Bu ers have been
shown in order to mitigate the I/O bottlenecks of
dataintensive workloads [
        <xref ref-type="bibr" rid="ref19 ref22 ref25 ref6">19, 6, 22, 25</xref>
        ]. Most studies surrounding
the design and use of Burst Bu ers have so far focused on the
I/O characteristics of individual applications [
        <xref ref-type="bibr" rid="ref26">26</xref>
        ] or small
components within work ows [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ]. However, research into
optimizing scienti c work ows with diverse I/O and storage
requirements for Burst Bu ers is still in its infancy, and a
limited body of work presently exists [
        <xref ref-type="bibr" rid="ref13 ref5">13, 5</xref>
        ]. Beyond
single applications and work ows, researchers are investigating
I/O-aware scheduling on systems with a Burst Bu er.
Herbein et al. [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ] demonstrate that system utilization can be
improved by using application drain bandwidth between the
Burst Bu er and PFS as a scheduling constraint. Thapaliya
et al. [
        <xref ref-type="bibr" rid="ref24">24</xref>
        ] show how di erent Burst Bu er allocation policies
and the order of servicing I/O requests a ects total
application throughput on a system with a shared Burst Bu er.
DataWarp. DataWarp is Cray's implementation of a Burst
Bu er, and few guidelines exist for how to use it optimally
for scienti c work ows. Bhimji et. al show performance
results for a collection of applications selected as part of
NERSC's Early User Program [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. The results focus on
application I/O bandwidth on DataWarp and the PFS. The
NERSC website provides a list of known issues and
overall guidelines for achieving high performance, but does not
show when, why and how to use DataWarp for speci c
workow use cases [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Our work has analyzed two data analytics
work ows and identi ed I/O signatures along with the
speci c work ow requirements to advise how to use DataWarp.
2.2
      </p>
    </sec>
    <sec id="sec-4">
      <title>The NERSC Burst Buffer Architecture</title>
      <p>
        NERSC's Cori system features a Burst Bu er based on
Cray DataWarp [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ]. This architecture is built upon discrete
Burst Bu er nodes (BB nodes), each containing two Intel
P3608 SSDs that deliver 6.4 TiB of usable capacity and 5.7
GiB/s of bandwidth. Currently, Cori has a total of 144 BB
nodes, over 900 TiB of usable capacity, and over 800 GiB/sec
of peak performance.
      </p>
      <p>Cray's DataWarp middleware aggregates the SSDs on each
of the BB nodes and provides user jobs with dynamically
provisioned private parallel le systems. Users can request
a certain capacity of Burst Bu er in 200 GiB increments
(which we call fragments) when submitting jobs. Each
fragment is allocated on a di erent BB node to allow the
aggregate performance of the BB allocation to scale with the
requested capacity. DataWarp also designates one of the BB
nodes as the metadata server for the allocation. This
allocation is mounted on the job nodes when the job is launched,
and it is typically torn down upon job completion. However,
users may also request a persistent mode allocation, which
allows a BB allocation to persist across multiple jobs.</p>
      <p>DataWarp also o ers private mode reservations where each
compute node gets its own metadata server within the Burst
Bu er allocation and, by extension, its own private
namespace. This enables higher aggregate metadata performance
since each compute node's metadata is serviced by a unique
BB node.
3.</p>
    </sec>
    <sec id="sec-5">
      <title>METHODOLOGY</title>
      <p>In this section, we detail our performance analysis
methodology and workloads used for our analyses.
3.1</p>
    </sec>
    <sec id="sec-6">
      <title>Workflows</title>
      <p>The two work ows studied in the paper were selected
because they stress the I/O subsystem in very di erent ways:
CAMP is limited by metadata performance and SWarp is
limited by data transfer performance. When discussing the
work ows, we use the term \work ow pipeline" to refer to a
single unit of the larger work ow.
3.1.1</p>
      <sec id="sec-6-1">
        <title>CAMP</title>
        <p>
          The CAMP (Community Access MODIS Pipeline)
workow processes Earth's land and atmospheric data obtained
from MODIS satellite data [
          <xref ref-type="bibr" rid="ref16 ref27 ref3">3, 27, 16</xref>
          ]. It transforms the
MODIS data from a swath space and time coordinate system
(latitude and longitude) into a sinusoidal tiling system (tiles
using sinusoidal projection). The MODIS data for CAMP
consists of small geometa les in plain text format and swath
products as Hierarchical Data Format (HDF) les. Each
geometa le is only a few KBs and is used by all the swath
products from a particular satellite. Each swath product
has several les per day, each of which is approx. 1.1 MB in
size and contains the product data in swath space and time
coordinate system.
        </p>
        <p>The CAMP work ow consists of two processing steps {
a) builddb, that assembles and maps swaths to their
corresponding sinusoidal tiles and b) reproject, that converts
the MODIS products from a swath coordinate system to a
sinusoidal tiling system. Figure 1 shows the high-level
representation of the CAMP work ow that includes the data
staging operations to and from the Burst Bu er. The
workow pipeline in this paper transforms one MODIS product's
swath coordinates for one day into one speci c tile. CAMP
is written in Python and generates an intermediate SQLite
database to provide the mapping for the reproject stage. We
use Conda, which uses the Anaconda Python distribution,
to install CAMP on DataWarp.
3.1.2</p>
      </sec>
      <sec id="sec-6-2">
        <title>SWarp</title>
        <p>The SWarp work ow combines overlapping raw images of
the night sky into high quality reference images. It is used in
the Dark Energy Camera Legacy Survey (DECaLS) to
produce high quality images of 14,000 deg2 of northern
hemisphere sky. In this survey, each SWarp work ow pipeline
produces an image for a 0.25 deg2 \brick" of sky. The
average input to each work ow pipeline is 16 32 MiB input
images and 16 16 MiB input weight maps.</p>
        <p>The SWarp work ow pipeline consists of a data
resampling stage and a data combination stage. The data
resampling stage interpolates the raw images and creates
resampled images which can be trivially stacked. The data
combination stage reads back the resampled images and then
performs a reduction over the pixels to produce a single
stacked image. The raw, resampled and stacked images are
all in Flexible Image Transport System (FITS) le format.
The DAG when using a Burst Bu er is similar to CAMP:
input images and weight map les are staged-in prior to the
data resampling stage and the combined image is staged-out
after the data combination stage. SWarp is written in C and
multithreaded with POSIX threads.
3.2</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>Workload Configuration</title>
      <p>
        The work ow pipelines are run in their production con
guration on Cori and all I/O is directed to DataWarp mount
points. The DataWarp reservation is con gured to use a
shared namespace and one fragment of capacity. A job
reservation is used for SWarp and a persistent reservation is used
for CAMP (in order to retain the CAMP Python software
environment between jobs). The Integrated Performance
Monitoring (IPM) pro ling tool [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] is used to collect run
time, memory usage and time in di erent I/O calls for each
work ow stage. The work ow pipelines are then replicated
on 1 to 64 compute nodes (with 1 work ow pipeline per
compute node) and I/O is directed to a xed storage reservation
of 1 DataWarp fragment. This allows us to study how run
time is a ected by the saturation of the storage resource.
      </p>
    </sec>
    <sec id="sec-8">
      <title>RESULTS</title>
      <p>The high-level characteristics of the stages in a single
work ow pipeline are shown in Table 1. The work ow stages
are found to spend 10 - 30 % of time in I/O. This is the best
achievable I/O time and can only get worse as more
workow pipelines contend for the same storage resource.</p>
      <p>Figures 2 and 3 show how I/O time changes with
concurrency for the most time-consuming stage of each work ow.
I/O time is divided into time spent in metadata operations
and data operations. The experiments are repeated three
times at each node count and the plots show the mean time
per work ow pipeline stage. The error bars simply show the
range of mean times over the three experiments.</p>
      <p>Figure 2 shows the scaling of SWarp-resample. The
results show that wall clock time remains relatively constant
until about 16 work ow pipelines and that I/O time is
domSWarp
rsmpl</p>
      <p>SWarp
coadd</p>
      <p>CAMP
db</p>
      <p>
        CAMP
reprj
Compute threads
I/O threads
Wall time (s)
I/O time (s)
I/O time (%)
Peak mem. (MiB)
Total le size (MiB)
16
1
inated by data rather than metadata operations. Figure 3
shows the scaling of CAMP-builddb is limited by metadata
performance. One source of these metadata operations is
from the startup of Python applications, which is known to
be a scalability issue in Python HPC applications [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]. It
happens because Python searches for les providing a
package in every directory in the Python path. In spite of this,
the dominant source of metadata load in CAMP-builddb are
the transactions to the SQLite database.
      </p>
    </sec>
    <sec id="sec-9">
      <title>DISCUSSION</title>
      <p>In this section, a) we discuss the key characteristics of the
work ows analyzed and use the information to highlight the
e ective use of Burst Bu ers and, b) we apply this
knowledge to explain how to achieve the optimum performance
with the DataWarp implementation of a Burst Bu er.
5.1</p>
    </sec>
    <sec id="sec-10">
      <title>Efficient use of Burst Buffers</title>
      <p>The key ndings from our experimental analyses are:
1. A single work ow pipeline does not provide
the I/O parallelism needed to make e cient
use of Burst Bu ers. The data analytics work ows
studied in this paper consist of single-process
applications which perform I/O with a single thread of
execution. This is poorly matched with the need to have
multiple I/O streams to obtain the peak performance
from Burst Bu er Flash storage. Unfortunately,
single I/O stream work ow pipelines are a common
feature of high throughput data analytics work ows. We
show that better utilization of Burst Bu er resources
is possible by executing multiple concurrent work ow
pipelines against the same unit of Burst Bu er storage.
Our results indicate that a single unit of DataWarp
storage on Cori can sustain the I/O requests from
approximately 16 concurrent work ow pipelines before
there is any slow down.
2. A scaled out work ow pipeline is often limited
by metadata performance. Our analysis has found
signi cant metadata costs originating from database
transactions, Python initialization and opening many
small les. The aggregated metadata operations from
multiple work ow pipelines can easily saturate a
single metadata server, as shown in the CAMP-builddb
work ow stage.
3. It is valuable to explicitly control the data in
the Burst Bu er tier. The work ows read input
data sets and produce a number of intermediate les
which can be discarded once there are nal results,
e.g. the resampled images in SWarp and the SQLite
database in CAMP. Therefore, we do not expect
automatic le movement between the Burst Bu er and
the PFS to bene t these two data analytics work ows.
This is because the one-time cost of staging the input
data at access time may not be hidden by signi cant
data reuse. Automatic le movement would also
transfer the intermediate les to the PFS unnecessarily.
4. It is valuable to leave data in the Burst Bu er
tier for longer than a single batch job. We have
found that input les and software environments are
reused across work ow pipelines.</p>
      <p>The input data for data analytics work ows are
generally Write Once Read Many times (WORM).
In the SWarp work ow a single input image often
contributes to multiple regions of the sky.
Therefore it is wasteful to re-stage the same input le
multiple times for each work ow pipeline.</p>
      <sec id="sec-10-1">
        <title>The software environment is reused in every sin</title>
        <p>gle work ow pipeline. In the CAMP work ow the
Python environment is responsible for some of the
I/O. The role of \support I/O" (e.g. Python
packages) is rarely mentioned in the context of Burst
Bu ers. It is useful to stage the software
environment once to avoid the overhead and wear of
repeatedly staging the software environment.
Long-term data residency is not a good t for today's
Burst Bu ers because they do not provide data
redundancy. This imposes a data management burden upon
the developer.
5.2</p>
      </sec>
    </sec>
    <sec id="sec-11">
      <title>Efficient use of DataWarp</title>
      <p>DataWarp storage reservations on Cori consist of
multiple storage fragments of size 200 GiB. The scaling studies
show that both SWarp and CAMP are limited by DataWarp
performance rather than capacity. SWarp and CAMP have
an aggregate capacity requirement of up to 2.6 GiB and 150
MiB per work ow pipeline, respectively (Table 1). However,
the performance saturates before fully utilizing the 200 GiB
of capacity at approximately 16 work ow pipelines per
DataWarp fragment. This means that excess capacity must be
reserved to sustain performance in a scaled out work ow.
Metadata bottlenecks, such as seen in CAMP-builddb, can
be addressed by combining the reservation of excess capacity
with the private mode feature of DataWarp.
6.</p>
    </sec>
    <sec id="sec-12">
      <title>CONCLUSION</title>
      <p>In this paper we analyzed the performance of two
scienti c work ows running on the Cori supercomputer with
the DataWarp Burst Bu er. We show that a single
workow pipeline does not have the parallelism to utilize the
capabilities of the Flash storage hardware. We also show
that the work ows have di erent I/O performance
characteristics: SWarp is bound by data transfer performance and
CAMP (speci cally CAMP-builddb) is bound by metadata
performance as the work ows are scaled out. The results are
used to give general advice about using Burst Bu ers more
e ciently and to provide speci c advice for DataWarp.</p>
    </sec>
    <sec id="sec-13">
      <title>Acknowledgments</title>
      <p>This work was supported by Laboratory Directed Research
and Development (LDRD) funding from Berkeley Lab,
provided by the Director, O ce of Science and O ce of Science,
O ce of Advanced Scienti c Computing Research (ASCR)
of the U.S. Department of Energy under Contract No.
DEAC02-05CH11231. This research used resources of the
National Energy Research Scienti c Computing Center, a DOE
O ce of Science User Facility supported by the O ce of
Science of the U.S. Department of Energy under Contract No.
DE-AC02-05CH11231. The authors would also like to thank
Rollin Thomas for help with installing the CAMP Python
software environment on DataWarp.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <article-title>[1] Burst Bu er</article-title>
          . NERSC website: http://www:nersc:gov/ users/computational-systems/cori/burst-bu
          <source>er/; accessed 31 August</source>
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2] IPM. https://github:com/nerscadmin/IPM; accessed
          <issue>13</issue>
          <year>July 2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>NASA</given-names>
            <surname>MODIS</surname>
          </string-name>
          <article-title>Website</article-title>
          . http://modis:gsfc:nasa:gov/.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4] Trinity / NERSC-8
          <source>Use Case Scenarios. Technical Report SAND</source>
          <year>2013</year>
          -2941
          <string-name>
            <surname>P</surname>
          </string-name>
          , Los Alamos National Laboratory, Sandia National Laboratories,
          <string-name>
            <surname>NERSC</surname>
          </string-name>
          , Apr.
          <year>2013</year>
          . https://www:nersc:gov/assets/Trinity--NERSC-8- RFP/Documents/trinity-NERSC8
          <string-name>
            <surname>-</surname>
          </string-name>
          use
          <string-name>
            <surname>-</surname>
          </string-name>
          case
          <source>-v1:2a:pdf; accessed 4 October</source>
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>APEX</surname>
          </string-name>
          <article-title>Work ows</article-title>
          .
          <source>Technical report</source>
          , Los Alamos National Laboratory,
          <string-name>
            <surname>NERSC</surname>
          </string-name>
          , and Sandia National Laboratories, Los Alamos, NM,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>J.</given-names>
            <surname>Bent</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Grider</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Kettering</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Manzanares</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>McClelland</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Torres</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Torrez</surname>
          </string-name>
          .
          <article-title>Storage challenges at los alamos national lab</article-title>
          .
          <source>In IEEE 28th Symposium on Mass Storage Systems and Technologies (MSST)</source>
          , pages
          <fpage>1</fpage>
          <lpage>{</lpage>
          5,
          <string-name>
            <surname>April</surname>
          </string-name>
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>G. B.</given-names>
            <surname>Berriman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Deelman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. C.</given-names>
            <surname>Good</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. C.</given-names>
            <surname>Jacob</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. S.</given-names>
            <surname>Katz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Kesselman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. C.</given-names>
            <surname>Laity</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. A.</given-names>
            <surname>Prince</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Singh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>and M.-H.</given-names>
            <surname>Su</surname>
          </string-name>
          .
          <article-title>Montage: a grid-enabled engine for delivering custom science-grade mosaics on demand</article-title>
          ,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>S.</given-names>
            <surname>Bharathi</surname>
          </string-name>
          and
          <string-name>
            <given-names>A.</given-names>
            <surname>Chervenak</surname>
          </string-name>
          .
          <article-title>Scheduling data-intensive work ows on storage constrained resources</article-title>
          .
          <source>In Proceedings of the 4th Workshop on Work ows in Support of Large-Scale Science, WORKS '09</source>
          , pages
          <issue>3:1</issue>
          {3:
          <fpage>10</fpage>
          , New York, NY, USA,
          <year>2009</year>
          . ACM.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>S.</given-names>
            <surname>Bharathi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Chervenak</surname>
          </string-name>
          , E. Deelman, G. Mehta,
          <string-name>
            <given-names>M. H.</given-names>
            <surname>Su</surname>
          </string-name>
          , and
          <string-name>
            <given-names>K.</given-names>
            <surname>Vahi</surname>
          </string-name>
          .
          <article-title>Characterization of scienti c work ows</article-title>
          .
          <source>In 2008 Third Workshop on Work ows in Support of Large-Scale Science</source>
          , pages
          <volume>1</volume>
          {
          <fpage>10</fpage>
          ,
          <string-name>
            <surname>Nov</surname>
          </string-name>
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>W.</given-names>
            <surname>Bhimji</surname>
          </string-name>
          et al.
          <article-title>Accelerating Science with the NERSC Burst Bu er Early User Program</article-title>
          . In Cray User Group CUG, May
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>S.</given-names>
            <surname>Byna</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Uselton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Knaak</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Y. H.</given-names>
            <surname>He</surname>
          </string-name>
          .
          <article-title>Lessons Learned from a Hero I/O Run on Hopper</article-title>
          . In 2013 Cray User Group Meeting, Napa, CA,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>P.</given-names>
            <surname>Carns</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Lang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Ross</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Vilayannur</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Kunkel</surname>
          </string-name>
          , and
          <string-name>
            <given-names>T.</given-names>
            <surname>Ludwig</surname>
          </string-name>
          .
          <article-title>Small- le access in parallel le systems</article-title>
          .
          <source>In 2009 IEEE International Symposium on Parallel &amp; Distributed Processing</source>
          , pages
          <volume>1</volume>
          {
          <fpage>11</fpage>
          . IEEE, may
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>C. S.</given-names>
            <surname>Daley</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Ramakrishnan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Dosanjh</surname>
          </string-name>
          , and
          <string-name>
            <given-names>N. J.</given-names>
            <surname>Wright</surname>
          </string-name>
          .
          <article-title>Analyses of Scienti c Work ows for E ective Use of Future Architectures</article-title>
          .
          <source>In Proceedings of the 6th International Workshop on Big Data Analytics: Challenges, and Opportunities (BDAC-15)</source>
          , Austin, TX,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>E.</given-names>
            <surname>Deelman</surname>
          </string-name>
          and
          <string-name>
            <given-names>A.</given-names>
            <surname>Chervenak</surname>
          </string-name>
          .
          <article-title>Data management challenges of data-intensive scienti c work ows</article-title>
          .
          <source>In Cluster Computing and the Grid</source>
          ,
          <year>2008</year>
          . CCGRID '
          <volume>08</volume>
          . 8th IEEE International Symposium on, pages
          <volume>687</volume>
          {
          <fpage>692</fpage>
          , May
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>J.</given-names>
            <surname>Enkovaara</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N. A.</given-names>
            <surname>Romero</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Shende</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J. J.</given-names>
            <surname>Mortensen</surname>
          </string-name>
          .
          <article-title>Gpaw - massively parallel electronic structure calculations with python-based software</article-title>
          .
          <source>Procedia Computer Science</source>
          ,
          <volume>4</volume>
          :
          <fpage>17</fpage>
          {
          <fpage>25</fpage>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>V.</given-names>
            <surname>Hendrix</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Ramakrishnan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Ryu</surname>
          </string-name>
          , C. van
          <string-name>
            <surname>Ingen</surname>
            ,
            <given-names>K. R.</given-names>
          </string-name>
          <string-name>
            <surname>Jackson</surname>
            ,
            <given-names>and D.</given-names>
          </string-name>
          <string-name>
            <surname>Agarwal</surname>
          </string-name>
          . CAMP:
          <article-title>Community Access MODIS Pipeline</article-title>
          .
          <source>Future Generation Computer Systems</source>
          ,
          <volume>36</volume>
          :
          <fpage>418</fpage>
          {
          <fpage>429</fpage>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>D.</given-names>
            <surname>Henseler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Landsteiner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Petesch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Wright</surname>
          </string-name>
          , and
          <string-name>
            <given-names>N.</given-names>
            <surname>Wright</surname>
          </string-name>
          .
          <article-title>Architecture and Design of Cray DataWarp</article-title>
          . In Cray User Group CUG, May
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>S.</given-names>
            <surname>Herbein</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. H.</given-names>
            <surname>Ahn</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Lipari</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. R.</given-names>
            <surname>Scogland</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Stearman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Grondona</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Garlick</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Springmeyer</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Taufer</surname>
          </string-name>
          .
          <article-title>Scalable I/O-Aware Job Scheduling for Burst Bu er Enabled HPC Clusters</article-title>
          .
          <source>In Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing</source>
          ,
          <source>HPDC '16</source>
          , pages
          <fpage>69</fpage>
          {
          <fpage>80</fpage>
          , New York, NY, USA,
          <year>2016</year>
          . ACM.
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>N.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Cope</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Carns</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Carothers</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Ross</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Grider</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Crume</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C.</given-names>
            <surname>Maltzahn</surname>
          </string-name>
          .
          <article-title>On the role of burst bu ers in leadership-class storage systems</article-title>
          .
          <source>In IEEE 28th Symposium on Mass Storage Systems and Technologies (MSST)</source>
          , pages
          <fpage>1</fpage>
          {
          <fpage>11</fpage>
          ,
          <string-name>
            <surname>Apr</surname>
          </string-name>
          .
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <surname>H. M. Monti</surname>
            ,
            <given-names>A. R.</given-names>
          </string-name>
          <string-name>
            <surname>Butt</surname>
            , and
            <given-names>S. S.</given-names>
          </string-name>
          <string-name>
            <surname>Vazhkudai</surname>
          </string-name>
          .
          <article-title>On timely staging of hpc job input data</article-title>
          .
          <source>IEEE Transactions on Parallel and Distributed Systems</source>
          ,
          <volume>24</volume>
          (
          <issue>9</issue>
          ):
          <year>1841</year>
          {
          <year>1851</year>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>L.</given-names>
            <surname>Ramakrishnan</surname>
          </string-name>
          and
          <string-name>
            <given-names>B.</given-names>
            <surname>Plale</surname>
          </string-name>
          .
          <article-title>A multi-dimensional classi cation model for scienti c work ow characteristics</article-title>
          .
          <source>In Proceedings of the 1st International Workshop on Work ow Approaches to New Data-centric Science, Wands '10</source>
          , pages
          <issue>4:1</issue>
          {4:
          <fpage>12</fpage>
          , New York, NY, USA,
          <year>2010</year>
          . ACM.
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>K.</given-names>
            <surname>Sato</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Mohror</surname>
          </string-name>
          , A. Moody, T. Gamblin,
          <string-name>
            <given-names>B. R.</given-names>
            <surname>d</surname>
          </string-name>
          . Supinski,
          <string-name>
            <given-names>N.</given-names>
            <surname>Maruyama</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Matsuoka</surname>
          </string-name>
          .
          <article-title>A user-level in niband-based le system and checkpoint strategy for burst bu ers</article-title>
          .
          <source>In Cluster, Cloud and Grid Computing (CCGrid)</source>
          ,
          <year>2014</year>
          14th IEEE/ACM International Symposium on, pages
          <volume>21</volume>
          {
          <fpage>30</fpage>
          , May
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>K. A.</given-names>
            <surname>Standish</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. M.</given-names>
            <surname>Carland</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. K.</given-names>
            <surname>Lockwood</surname>
          </string-name>
          , W. Pfei er, M. Tatineni,
          <string-name>
            <given-names>C. C.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Lamberth</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Cherkas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Brodmerkel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Jaeger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Smith</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Rajagopal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. E.</given-names>
            <surname>Curran</surname>
          </string-name>
          , and
          <string-name>
            <given-names>N. J.</given-names>
            <surname>Schork</surname>
          </string-name>
          .
          <article-title>Group-based variant calling leveraging next-generation supercomputing for large-scale whole-genome sequencing studies</article-title>
          .
          <source>BMC Bioinformatics</source>
          ,
          <volume>16</volume>
          (
          <issue>1</issue>
          ):304, dec
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>S.</given-names>
            <surname>Thapaliya</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Bangalore</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lofstead</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Mohror</surname>
          </string-name>
          , and A. Moody.
          <article-title>Managing I/O Interference in a Shared Burst Bu er System</article-title>
          .
          <source>In 2016 45th International Conference on Parallel Processing (ICPP)</source>
          , pages
          <fpage>416</fpage>
          {
          <fpage>425</fpage>
          ,
          <string-name>
            <surname>Aug</surname>
          </string-name>
          .
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <surname>B. Van Essen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Pearce</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ames</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Gokhale</surname>
          </string-name>
          .
          <article-title>On the Role of NVRAM in Data-intensive Architectures: An Evaluation</article-title>
          .
          <source>In 2012 IEEE 26th International Parallel and Distributed Processing Symposium</source>
          , pages
          <volume>703</volume>
          {
          <fpage>714</fpage>
          . IEEE, may
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <given-names>T.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Oral</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Pritchard</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Vasko</surname>
          </string-name>
          , and
          <string-name>
            <given-names>W.</given-names>
            <surname>Yu</surname>
          </string-name>
          .
          <article-title>Development of a burst bu er system for data-intensive applications</article-title>
          . CoRR, abs/1505.01765,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <given-names>R. E.</given-names>
            <surname>Wolfe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. P.</given-names>
            <surname>Roy</surname>
          </string-name>
          , and
          <string-name>
            <given-names>E.</given-names>
            <surname>Vermote</surname>
          </string-name>
          .
          <article-title>Modis land data storage, gridding, and compositing methodology: Level 2 grid</article-title>
          .
          <source>IEEE Transactions on Geoscience and Remote Sensing</source>
          ,
          <volume>36</volume>
          (
          <issue>4</issue>
          ):
          <volume>1324</volume>
          {
          <fpage>1338</fpage>
          ,
          <string-name>
            <surname>Jul</surname>
          </string-name>
          <year>1998</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [28]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. S.</given-names>
            <surname>Vazhkudai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Ma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. G.</given-names>
            <surname>Pike</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. W.</given-names>
            <surname>Cobb</surname>
          </string-name>
          , and
          <string-name>
            <given-names>F.</given-names>
            <surname>Mueller</surname>
          </string-name>
          .
          <article-title>Optimizing center performance through coordinated data staging, scheduling and recovery</article-title>
          .
          <source>In Proceedings of the 2007 ACM/IEEE Conference on Supercomputing, SC '07</source>
          , pages
          <issue>55:1</issue>
          {
          <fpage>55</fpage>
          :
          <fpage>11</fpage>
          , New York, NY, USA,
          <year>2007</year>
          . ACM.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>