<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Life Sciences &amp; The Dutch Grid: An Analysis from a Grid Supporter's perspective</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Lammerts</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>e-Science Support Group, SARA Computing and Networking Services</institution>
          ,
          <addr-line>Science Park 121, 1098 XG Amsterdam</addr-line>
          ,
          <country country="NL">The Netherlands</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2009</year>
      </pub-date>
      <abstract>
        <p>Motivation:  Over   the   past   few   years   extensive   effort   has   been  undertaken to enable the Life Science community to implement the  Grid in their research. Although much progress has been made we  observe that Grid is still far from being utilized to its fullest extent.  We   do,   however,   believe   that   the   concepts   of   Grid   have   great  potential   for  this  specific   target  group.   Therefore   it  is  important  to  summarize the problems on a conceptual level in order to provide a  sense of direction to the ongoing international development efforts.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>INTRODUCTION TO THE DUTCH LIFE </title>
    </sec>
    <sec id="sec-2">
      <title>SCIENCE GRID</title>
      <p>All sites with the 'LSG' prefix have been set up in the context of
the Life Science Grid1 project. These sites, with the exception of
LSG-EMC, have 16 (32 bit) cores and 1.5 Terra Byte of storage.
The storage will be upgraded to 20 Terra Byte in the near future.
LSG-EMC has 32 cores and 20 Terra Byte storage, and five more
sites will be set up in the near future with the same capacity. In
addition, all LSG sites will be upgraded to 64 bit architectures. All
sites run the gLite 3.1.35 middleware with DPM 3.1.28-0 as
Storage Resource Manager (SRM) protocol implementation for the
Storage Elements (SE's).</p>
      <p>The Life Science Grid project is an initiative of SARA
Computing and Networking Services2 and commissioned by the
The part of the Dutch Grid accessible to Dutch Life Science
Virtual Organizations (VO's) is divided over eleven different sites
on eight different locations (see Illustration 1: Sites in The
Netherlands accessible for Dutch Life Science VO's).</p>
      <p>Netherlands Computer Facilities Foundation3 (NCF) and the
Netherlands Bioinformatics Center4 (NBIC) in their capacity of
1 https://grid.sara.nl/wiki/index.php/Life_Science_Grid
2 http://www.sara.nl
3 http://www.nwo.nl/ncf
4 http://www.nbic.nl
founding partners in the BigGrid project5. SARA, one of the core
partners in BigGrid, is responsible for the placing- and
administration of the clusters, as well as providing support for their
users.
2</p>
    </sec>
    <sec id="sec-3">
      <title>DIFFICULTIES EXPERIENCED BY THE DUTCH </title>
    </sec>
    <sec id="sec-4">
      <title>LIFE SCIENTISTS ON THE GRID</title>
      <p>The difficulties that the Life Scientists experience when using the
Grid are communicated back to us through incident-based notices.
Most of these can be filed under one of these three categories:
• Organizational: how does XXX work, where can I find
information about it and through what channel can I request
support?
• Technical: my job has successfully generated output but
when I try to retrieve it from the SE it does not seem to exist,
what is wrong?
• Naive: I run hundreds of jobs which stage output files of 1gb
each in the output sandbox, why do almost all of my jobs fail?
2.1</p>
      <sec id="sec-4-1">
        <title>Naivety-based problems</title>
        <p>Over the past two decades the modality of performing Life Science
has been changing6. Research has started to shift away from
hypothesis drivenness and focuses more on
bulk-data-generationand-interpretation drivenness. This shift is where HPC and HTC
come into the picture.</p>
        <p>However, the shift is not complete. Still today, many Life
Scientists are struggling with the new perspective. Fundamental
concepts of data-centric research have not yet settled in the field.
Because more financial means become available to stimulate the
use of HPC and HTC within the Life Sciences, the traditional- and
new perspectives are intertwining; but not without the expected
difficulties that cause the struggle of the Life Scientist.</p>
        <p>A good example is Bioinformatics. Over the past decades the
fields of Biology, Informatics and Mathematics, of which Statistics
and Probability Theory in particular, have begun to overlap (see
Illustration 2: The overlap between Biology, Informatics and
Mathematics), and as a result the field of Bioinformatics was
created. However, we notice a lack of 'pure' Bioinformaticians –
not many know the details of an algorithm, how to implement it in
software and interpret the results. Most scientists are located in the
field of what we call BioICT: they know how to use software and
interpret results, but have little knowledge of the ins and outs of
the algorithm, let alone of its implementation. Hence, the
requirements for data-centric Biological research are not
completely met. Although BioICT is a logical first step for the
scientist it still requires close cooperation with BioStatistics and
Statistical Informatics.</p>
        <p>While the Life Scientist tries to find his way in Computer
Science he runs in to the problem of dealing with- and
management of large scale distributed systems. For example, when
running a software like BLAST7 (Basic Local Alignment Search
Tool) on a desktop computer, the researcher specifies his input
data, runs the program and when it finishes, his output data is
5 http://www.biggrid.nl/
6 Gusfield, D. (2002) http://webcast.ucdavis.edu/Engineering/2008/EC
S124_02/ECS124_4-1-02_L-1.asx
7 http://www.ncbi.nlm.nih.gov/BLAST/
available directly. When utilizing the Grid this work flow is less
intuitive (from his perspective, at least). “Your job is not being
submitted because you have not delegated your proxy”, “You
should stage your output data on an SE because the WMS cannot
handle that much output data in an output sandbox”, “Your output
datasets are too many and too small, and cannot be handled by the
SE”, and other similar hints and tips keep recurring when
supporting the scientist.</p>
        <p>Since these problems have to do with the transition that is taking
place in the field, they are hard to solve. But the answer does not
end there. From our role as Computer Scientists and System
Engineers we have a responsibility to ease the transition – if not
from us, from who can a scientist learn to handle the computing
facilities?</p>
        <p>In practice these problems manifest in the following areas:
• Using the command line interface
• Job planning and management
Other problems originate from the technology rather than the
scientist. The Grid, although continuously maturing, is still not a
production infrastructure. Further more, it is based on academic
software, which is not always as stable as we would like.</p>
        <p>One of the general issues is the trade-off that exists between the
scale and the stability of the Grid. The growth of the Grid (in
systems as well as people) is parallel to the growth of the surface
on which errors can occur. We notice that the occurrence of errors
as a result of the scale are not just conceptual; often when an
update or upgrade of some component is done, an other
component, an interface or a client breaks in one way or another.
Illustration 3: The concept of Storage Elements</p>
        <p>There is another dimension that requires attention. The major
early adopter (if not initiator) of Grid technology was the High
Energy Physics (HEP) community. As such it contributed much to
the current state of the Grid; it defined its needs ever more clearly
over the years and influenced development of several systems that
are now common ground on the Grid.</p>
        <p>A common example is data management. The concept of Grid
data management in The Netherlands is based on the SRM
protocol. Am implementation of this protocol is at the heart of an
SE (in The Netherlands this implementation is usually either
dCache8 or Disk Pool Manager9 (DPM)). For an overview of a
common Data Management solution see Illustration 3: The concept
of Storage Elements.</p>
        <p>This set-up is a fairly general one for distributed data storage and
scales fairly well. However, the SRM protocol has three downsides
from the perspective of the Life Scientist:
First, each transfer has major overhead because of the scope of the
protocol. It is known to be very inefficient when transferring
many small files, which is a common use-case within the Life
Sciences, in contrast to HEP.</p>
        <p>Second, the protocol is not common. It has no general purpose
clients available and the storage is not locally mountable.
Further more, it has only one official client interface, which
is based on the command line.</p>
        <p>Third, the implementations of the protocol are academic software.</p>
        <p>Experience learns that they are not as stable as we would like
them to be. This is a problem for small communities – while
the bigger communities like HEP are able to increase
redundancy by replicating their data over multiple countries,
the Life Science VO's are much more dependent on the local
sites.</p>
        <p>Without going into further detail by comparing the computing
needs of HEP and the Life Sciences, it is important to realize that
some of the current solutions do not fit the normal Life Science
use-case. Awareness of this issue is still young and was raised
mainly because the method of technology push, as applied so far,
was not successful. The signals coming from the field strongly
8 http://www.dcache.org/
9
https://twiki.cern.ch/twiki/bin/view/LCG/DataManagementDocumentation
indicate the need of the scientist for more influence on the state of
the Grid, or in other words, the need for a technology pull.</p>
        <p>Most technical problems we see are related to one of the
following:
• Authentication
• Job management and failure rates
• Data management
2.3</p>
      </sec>
      <sec id="sec-4-2">
        <title>Organization-based problems</title>
        <p>After the two previous, external factors, we need to turn around
and look at ourselves. One of the issues causing difficulties is
based in the horizontal way that support and documentation is
provided – no distinction is being made between the different types
of scientists or VO's. A single type and version of documentation
and education is provided for all. However, scientists can differ in
many ways from each other, from expertise and frame of reference
to substance of research and needs for computing. Although the
situation in The Netherlands might differ in other countries, it
remains crucial that support is provided in a way that fits the
scientist and his community.</p>
        <p>Issues with an organizational background are related to on of the
following:
• Starting with the Grid
• The role of the community or VO
3</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>MAPPING THE LIFE SCIENCES ONTO THE </title>
    </sec>
    <sec id="sec-6">
      <title>GRID AND VISE VERSA</title>
      <p>What steps can we take to map the Life Scientist onto the Grid and
the Grid onto the Life Scientist? The remainder of this paper is an
attempt to structure, and give context to a number of possible
solutions.
3.1</p>
      <sec id="sec-6-1">
        <title>Easing the transition between research modalities</title>
        <p>Based on the shift of research modalities as mentioned in the
previous chapter, we can define the technological factor of the shift
as one of “desktop to large scale”. So an obvious (but not
concrete) suggestion is to introduce the technology from the
Desktop-based perspective of the Life Scientist – letting him use
the Grid from his local Desktop computer.</p>
        <p>
          Many attempts have been made to provide Desktop access to the
Grid. The mission is, summarized, to let a scientist use large scale
distributed computing and storage resources from a familiar
environment. Examples of such initiatives include:
• VBrowser10, an attempt to provide access to storage services
on the Grid through an interface that resembles a standard
file-browser;
• inQ11, a browser tool for SRB;
• jGridStart12, a Java WebStart application to handle certificate
requests and to import a certificate into your browser
(currently in alpha);
• GridApps13, a REST (Representational State Transfer)
          <xref ref-type="bibr" rid="ref1">(Fielding, R.T. 2000)</xref>
          based interface for applications running
on the Grid;
• Leiden Grid Infrastructure14 (LGI), similar to GridApps.
        </p>
        <p>We believe that such low level interfaces offer real added value
to Life Scientists. They allow the scientist to, instead of dealing
with the Grid itself, do a familiar task from a familiar environment.
Of course, such interfaces do have major dependencies, and an
update or upgrade of one of the many components can (and will)
break some of them.</p>
        <p>Attempts of a different kind to provide Grid interfaces are
workflow systems. These systems, such as Taverna15 and Moteur16,
allow scientists to define a workflow of Grid jobs. Although these
applications are mostly Desktop based (or sometimes on-line, such
as the P-GRADE Portal17), their principle is different from that of
the preceding. They often replace the complexity of the Grid with a
type of complexity that approaches large scale computing from a
perspective that might be closer to the mindset of the scientist.</p>
        <p>We believe that such platforms provide much added value for
many different use-cases, but question its use for general
application within the Life Sciences. We notice that Life Science
research groups that utilize such software need to go through a
learning curve of which the steepness is not proportional to the
increase of productivity it delivers. Even though Life Scientists do
have a need for simple workflows, these workflows are often the
same or very similar across their domain. Therefore it should be
possible to take a simpler, but maybe less generic, approach.</p>
        <p>An example of such an approach is annotated application
services. Providing access to applications running on the Grid in
the way GridApps does is a good first step. However, after
collecting data, the Life Scientist typically needs to do more than a
single atomic step (as provided by GridApps) to get useful
information from his data. It is common that the Life Scientist
needs the Grid to pre-process, process and visualize or interpret his
data. Using GridApps he can do all of these steps, providing that
the applications he needs are available. It would be more useful,
however, if he could specify all his actions at once. That way he
would not have to deal with moving of data, monitoring of his
jobs, etcetera.
10 http://staff.science.uva.nl/~ptdeboer/vlet/page_vbrowser.html
11 http://www.sdsc.edu/srb/index.php/InQ
12 http://www.nikhef.nl/pub/projects/grid/gridwiki/index.php/JGridstart
13 https://ws2.grid.sara.nl/apps/
14 http://fwnc7003.leidenuniv.nl/LGI/
15 http://taverna.sourceforge.net/
16 http://modalis.polytech.unice.fr/softwares/moteur/start
17 http://portal.p-grade.hu/</p>
        <p>Providing annotations for application services on the Grid would
be a major step. (See Illustration 4: Tying Application Services.)
These annotations should specify input parameters and the type of
output generated. By checking them it would be possible to see
which application can take the output of another application as its
input. As an additional step, a client application could be
developed that checks for available application services and allows
the user to select which applications to run in which order.</p>
        <sec id="sec-6-1-1">
          <title>Illustration 4: Tying Application Services</title>
          <p>Another major issue is that of data management. From his
oldmodality frame of mind, the Life Scientist has no conception of the
complexity of data management on the Grid. Concepts like
replication or checksums do not come natural, since in his Desktop
environment they are irrelevant.</p>
          <p>An important issue regarding data management is that of its
current unreliability when storing many smaller files at once. This
is due to technical issues (for which we will make a suggestion in
the following chapter), and the available workarounds are hard for
the scientist to handle.</p>
          <p>We believe that the concepts of data management are
complicated, but knowledge of them is essential when working
with the Grid. Therefore, as long as no alternative is available, our
suggestion is to generate more teaching material on how data
management works and how it should be used, specifically by the
Life Scientists.</p>
          <p>Another option is to develop an interface based on common and
familiar protocols, such as WebDAV18, while maintaining the
current technologies. The advantage of such protocols is that the
remote file system can be mounted easily, because most operating
systems provide native support. However, the implementations of
these protocols on top of the existing infrastructure might prove
problematic.
3.2</p>
        </sec>
      </sec>
      <sec id="sec-6-2">
        <title>Adapting the Grid to suit the Life Sciences</title>
        <p>Apart from providing interfaces that fit the Life Scientist, much
work can be done to improve technical concepts of the Grid. Two
major issues can be identified. Respectively:
• Data Management
• Job failure rates</p>
        <p>The technical issues with Data Management on the Grid are well
known and documented. Although the concepts of distributed
storage as used on the Grid has been proven to be successful in
other domains, on the Grid it is to a lesser degree, for reasons that
have been discussed in the previous chapter.</p>
        <p>Based on the identified problems with the SRM protocol, which
in our opinion are the cause for most of the technical issues with
18 http://www.webdav.org/
Data Management on the Grid, we propose to implement a
different protocol. During selection the following criteria should be
maintained:
More efficient for smaller files. Since SRM is not able to deal with
many small files, the next protocol should be more efficient.
Common for enabling access to distributed storage. If we choose a
protocol that is widespread among different domains, that
means many clients and tools are available.</p>
        <p>Not academic. By choosing a protocol that goes beyond the
academic perspective, we can define a better framework for
our services. In the end, we as supporters, should provide a
service to our customers, the Life Scientist (among others).
Since not-academic software is much easier to rely on, we are
able to go as far as defining a Service Level Agreement
(SLA).</p>
        <p>The issue of job failure rates asks for a different approach. There
is no widely available alternative with the promise of better
performance.</p>
        <p>Since most of the errors occur between the a job submission and
the point it lands on a Worker Node, effort needs to be spent on
more actively advocating the use of Pilot Jobs. These are Grid jobs
that run on a meta-level; their main purpose is to get onto a Worker
Node and check the environment. After this is done, it can fetch a
task definition from an external source and start processing this
task. If all goes well, it can store the results of the task, delete the
task definition on the external source, and fetch a new task (also
see Illustration 5: Pilot Jobs).</p>
        <sec id="sec-6-2-1">
          <title>Illustration 5: Pilot Jobs</title>
          <p>
            There are some frameworks available that support this concept.
• Token Pool Server19 (ToPoS). An extremely efficient, open
source, REST based interface to store task definitions;
• Distributed Analysis Environment20 (DIANE). Provides
automatic control and scheduling of computations, and is part
of the EGEE respect suit;
19 http://topos.grid.sara.nl/4/
20 http://it-proj-diane.web.cern.ch/it-proj-diane/
• Condor21 GlideinWMS
            <xref ref-type="bibr" rid="ref2">(Sfiligoi, I. 2007)</xref>
            . Provides a similar
approach, but then from a Workload Management System
(WMS) perspective.
3.3
          </p>
        </sec>
      </sec>
      <sec id="sec-6-3">
        <title>Organizational suggestions</title>
        <p>Enabling the the Life Science researcher to use the Dutch Grid
infrastructure asks more than technical adjustments and additions.
It is important that the dissemination of information is handled in a
way that comes natural to the Life Scientist, to provide him with a
platform that has information dedicated to his specific problems,
and to provide clear channels for support. The details of such
organization are expected to be more of an internal issue, and
therefore out of the scope of this discussion. We do give an
overview of a situation that seems a promising improvement.</p>
        <p>We propose to introduce VO specific support and
documentation, in which the VO manager plays an active and
crucial role. Since a VO is build around a notion of similarity of its
members, we can assume that the Grid is used in similar ways to
achieve similar goals. Because these ways are directly related to
the research, it makes sense to enforce the generation of
documentation on tips and tools. Sharing is the keyword here.</p>
        <p>But what means do we have to enforce this? Since currently all
support is delivered from a central place, we can distribute
responsibilities and introduce a new first-line support, dedicated to
a specific VO. Since a VO does not have a dedicated person for
support, this channel needs to be maintained by the VO members.
The perfect tool for this is a 'many-to-many' mailing list as applied
by many successful open-source projects. A search-able archive
needs to be maintained, making up for documentation. Of course
there should be a possibility to escalate a question. For this we
suggest to keep on using a ticket system.</p>
        <p>Further more, an on-line repository needs to be maintained, in
which static documentation is stored. This repository needs to
include all VO specific documentation as well as links to generic
documentation. Another useful addition is a search-able database
of public papers, created in the context of this VO.
4</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>DISCUSSION</title>
      <p>The Grid has the potential to play an important role in data-centric
Life Science research. There are, however, still many problems to
be solved. These problems are on three distinct levels;
organization, technological and naivety. Currently a solution is
mostly being looked for on a technical level. However, it is our
believe that the solution is only to be found by keeping the bigger
picture in mind. We need to build a realization about the identity of
the Life Scientist and his research, and try to find solutions that fit
in his modality of research, mindset and capabilities.</p>
      <p>Of course many technical improvements can be made as well.
An important one, from the Life Scientist's perspective, is Data
Management. The current solution has not proven suitable for the
Life Sciences and effort needs to be put into finding a new
solution. Again, it is important to match this solution to a number
of criteria to make sure it is an actual workable solution.</p>
      <p>On an organizational level it is important to build stronger
Virtual Organizations that have a sense of self-preservation. This
means that knowledge specific to a VO should be contained and
21 http://www.uscms.org/SoftwareComputing/Grid/WMS/glideinWMS/
disseminated within the VO itself. To suit this set-up we need to
reorganize our knowledge dissemination and support.</p>
      <p>Further discussion and analysis is needed. It is important to be
aware that not only the Life Scientist needs to change to suit the
technology, we as Grid supporters also need to facilitate change to
suit the Life Scientist. Lets move on to find the right balance
between technology push and technology pull, to teach both
ourselves and the Life Scientist and to find forms of organization
that fit the natural form of the Life Science community.</p>
    </sec>
    <sec id="sec-8">
      <title>ACKNOWLEDGEMENTS</title>
      <p>The observations and corresponding suggestions in this paper were
obtained through, apart from own experiences, many discussions
with colleagues in the e-Science Support Group of the department
of High Performance Computing and Visualization. The paper has
been written with support from, and in the context of, the
International Workshop on Portals for the Life Sciences (IWPLS)
'09.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <surname>Fielding</surname>
            ,
            <given-names>R.T.</given-names>
          </string-name>
          (
          <year>2000</year>
          )
          <article-title>Representational State Transfer (REST) Architectural Styles and the Design of Network-based Software Architectures</article-title>
          , University of California, Irvine (CA), pp.
          <fpage>94</fpage>
          -
          <lpage>124</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <surname>Sfiligoi</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          (
          <year>2007</year>
          <source>) Journal of Physics: Conference Series glideinWMS - A generic pilot-based Workload Management System</source>
          ,
          <volume>119</volume>
          <fpage>2</fpage>
          -
          <lpage>5</lpage>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>