=Paper= {{Paper |id=Vol-513/paper-7 |storemode=property |title=Life Sciences & The Dutch Grid: An Analysis from a Grid Supporter's Perspective |pdfUrl=https://ceur-ws.org/Vol-513/paper07.pdf |volume=Vol-513 |dblpUrl=https://dblp.org/rec/conf/iwsg/Lammerts09 }} ==Life Sciences & The Dutch Grid: An Analysis from a Grid Supporter's Perspective== https://ceur-ws.org/Vol-513/paper07.pdf
IWPLS '09

Life Sciences & The Dutch Grid:
An Analysis from a Grid Supporter's perspective
Lammerts, E.1,
1
 e-Science Support Group, SARA Computing and Networking Services, Science Park 121, 1098 XG
Amsterdam, The Netherlands



ABSTRACT                                                                             All sites with the 'LSG' prefix have been set up in the context of
Motivation: Over the past few years extensive effort has been                     the Life Science Grid1 project. These sites, with the exception of
undertaken to enable the Life Science community to implement the                  LSG-EMC, have 16 (32 bit) cores and 1.5 Terra Byte of storage.
Grid in their research. Although much progress has been made we                   The storage will be upgraded to 20 Terra Byte in the near future.
observe that Grid is still far from being utilized to its fullest extent.         LSG-EMC has 32 cores and 20 Terra Byte storage, and five more
We do, however, believe that the concepts of Grid have great                      sites will be set up in the near future with the same capacity. In
potential for this specific target group. Therefore it is important to            addition, all LSG sites will be upgraded to 64 bit architectures. All
summarize the problems on a conceptual level in order to provide a                sites run the gLite 3.1.35 middleware with DPM 3.1.28-0 as
sense of direction to the ongoing international development efforts.              Storage Resource Manager (SRM) protocol implementation for the
                                                                                  Storage Elements (SE's).
1    INTRODUCTION TO THE DUTCH LIFE                                                  The Life Science Grid project is an initiative of SARA
     SCIENCE GRID                                                                 Computing and Networking Services2 and commissioned by the




Illustration 1: Sites in The Netherlands accessible for Dutch Life Science VO's

The part of the Dutch Grid accessible to Dutch Life Science                       Netherlands Computer Facilities Foundation3 (NCF) and the
Virtual Organizations (VO's) is divided over eleven different sites               Netherlands Bioinformatics Center4 (NBIC) in their capacity of
on eight different locations (see Illustration 1: Sites in The
Netherlands accessible for Dutch Life Science VO's).                              1
                                                                                    https://grid.sara.nl/wiki/index.php/Life_Science_Grid
                                                                                  2
                                                                                    http://www.sara.nl
                                                                                  3
                                                                                    http://www.nwo.nl/ncf
                                                                                  4
                                                                                    http://www.nbic.nl



© 2009                                                                                                                                               1
E. Lammerts



founding partners in the BigGrid project5. SARA, one of the core         available directly. When utilizing the Grid this work flow is less
partners in BigGrid, is responsible for the placing- and                 intuitive (from his perspective, at least). “Your job is not being
administration of the clusters, as well as providing support for their   submitted because you have not delegated your proxy”, “You
users.                                                                   should stage your output data on an SE because the WMS cannot
                                                                         handle that much output data in an output sandbox”, “Your output
2     DIFFICULTIES EXPERIENCED BY THE DUTCH                              datasets are too many and too small, and cannot be handled by the
      LIFE SCIENTISTS ON THE GRID                                        SE”, and other similar hints and tips keep recurring when
                                                                         supporting the scientist.
The difficulties that the Life Scientists experience when using the
                                                                            Since these problems have to do with the transition that is taking
Grid are communicated back to us through incident-based notices.
                                                                         place in the field, they are hard to solve. But the answer does not
Most of these can be filed under one of these three categories:
                                                                         end there. From our role as Computer Scientists and System
    • Organizational: how does XXX work, where can I find                Engineers we have a responsibility to ease the transition – if not
      information about it and through what channel can I request        from us, from who can a scientist learn to handle the computing
      support?                                                           facilities?
    • Technical: my job has successfully generated output but               In practice these problems manifest in the following areas:
      when I try to retrieve it from the SE it does not seem to exist,     • Using the command line interface
      what is wrong?
                                                                           • Job planning and management
    • Naive: I run hundreds of jobs which stage output files of 1gb
                                                                           • Data management
      each in the output sandbox, why do almost all of my jobs fail?
2.1      Naivety-based problems
Over the past two decades the modality of performing Life Science
has been changing6. Research has started to shift away from
hypothesis drivenness and focuses more on bulk-data-generation-
and-interpretation drivenness. This shift is where HPC and HTC
come into the picture.
   However, the shift is not complete. Still today, many Life
Scientists are struggling with the new perspective. Fundamental
concepts of data-centric research have not yet settled in the field.
Because more financial means become available to stimulate the
use of HPC and HTC within the Life Sciences, the traditional- and
new perspectives are intertwining; but not without the expected
                                                                         Illustration 2: The overlap between Biology,        Informatics and
difficulties that cause the struggle of the Life Scientist.
                                                                         Mathematics
   A good example is Bioinformatics. Over the past decades the
fields of Biology, Informatics and Mathematics, of which Statistics      2.2    Technology-based problems
and Probability Theory in particular, have begun to overlap (see         Other problems originate from the technology rather than the
Illustration 2: The overlap between Biology, Informatics and             scientist. The Grid, although continuously maturing, is still not a
Mathematics), and as a result the field of Bioinformatics was            production infrastructure. Further more, it is based on academic
created. However, we notice a lack of 'pure' Bioinformaticians –         software, which is not always as stable as we would like.
not many know the details of an algorithm, how to implement it in          One of the general issues is the trade-off that exists between the
software and interpret the results. Most scientists are located in the   scale and the stability of the Grid. The growth of the Grid (in
field of what we call BioICT: they know how to use software and          systems as well as people) is parallel to the growth of the surface
interpret results, but have little knowledge of the ins and outs of      on which errors can occur. We notice that the occurrence of errors
the algorithm, let alone of its implementation. Hence, the               as a result of the scale are not just conceptual; often when an
requirements for data-centric Biological research are not                update or upgrade of some component is done, an other
completely met. Although BioICT is a logical first step for the          component, an interface or a client breaks in one way or another.
scientist it still requires close cooperation with BioStatistics and
Statistical Informatics.
   While the Life Scientist tries to find his way in Computer
Science he runs in to the problem of dealing with- and
management of large scale distributed systems. For example, when
running a software like BLAST7 (Basic Local Alignment Search
Tool) on a desktop computer, the researcher specifies his input
data, runs the program and when it finishes, his output data is
5
  http://www.biggrid.nl/
6
  Gusfield, D. (2002) http://webcast.ucdavis.edu/Engineering/2008/EC
S124_02/ECS124_4-1-02_L-1.asx
7
  http://www.ncbi.nlm.nih.gov/BLAST/



2
                                                        Life Sciences & The Dutch Grid: An Analysis from a Grid Supporter's perspective




Illustration 3: The concept of Storage Elements

   There is another dimension that requires attention. The major          indicate the need of the scientist for more influence on the state of
early adopter (if not initiator) of Grid technology was the High          the Grid, or in other words, the need for a technology pull.
Energy Physics (HEP) community. As such it contributed much to              Most technical problems we see are related to one of the
the current state of the Grid; it defined its needs ever more clearly     following:
over the years and influenced development of several systems that
                                                                              • Authentication
are now common ground on the Grid.
   A common example is data management. The concept of Grid                   • Job management and failure rates
data management in The Netherlands is based on the SRM                        • Data management
protocol. Am implementation of this protocol is at the heart of an
SE (in The Netherlands this implementation is usually either              2.3      Organization-based problems
dCache8 or Disk Pool Manager9 (DPM)). For an overview of a                After the two previous, external factors, we need to turn around
common Data Management solution see Illustration 3: The concept           and look at ourselves. One of the issues causing difficulties is
of Storage Elements.                                                      based in the horizontal way that support and documentation is
   This set-up is a fairly general one for distributed data storage and   provided – no distinction is being made between the different types
scales fairly well. However, the SRM protocol has three downsides         of scientists or VO's. A single type and version of documentation
from the perspective of the Life Scientist:                               and education is provided for all. However, scientists can differ in
                                                                          many ways from each other, from expertise and frame of reference
First, each transfer has major overhead because of the scope of the
                                                                          to substance of research and needs for computing. Although the
      protocol. It is known to be very inefficient when transferring
                                                                          situation in The Netherlands might differ in other countries, it
      many small files, which is a common use-case within the Life
                                                                          remains crucial that support is provided in a way that fits the
      Sciences, in contrast to HEP.
                                                                          scientist and his community.
Second, the protocol is not common. It has no general purpose
                                                                             Issues with an organizational background are related to on of the
      clients available and the storage is not locally mountable.
                                                                          following:
      Further more, it has only one official client interface, which
      is based on the command line.                                           • Starting with the Grid
Third, the implementations of the protocol are academic software.             • The role of the community or VO
      Experience learns that they are not as stable as we would like
      them to be. This is a problem for small communities – while         3     MAPPING THE LIFE SCIENCES ONTO THE
      the bigger communities like HEP are able to increase
                                                                                GRID AND VISE VERSA
      redundancy by replicating their data over multiple countries,
      the Life Science VO's are much more dependent on the local          What steps can we take to map the Life Scientist onto the Grid and
      sites.                                                              the Grid onto the Life Scientist? The remainder of this paper is an
                                                                          attempt to structure, and give context to a number of possible
  Without going into further detail by comparing the computing            solutions.
needs of HEP and the Life Sciences, it is important to realize that
some of the current solutions do not fit the normal Life Science          3.1      Easing the transition between research modalities
use-case. Awareness of this issue is still young and was raised             Based on the shift of research modalities as mentioned in the
mainly because the method of technology push, as applied so far,          previous chapter, we can define the technological factor of the shift
was not successful. The signals coming from the field strongly            as one of “desktop to large scale”. So an obvious (but not
                                                                          concrete) suggestion is to introduce the technology from the
                                                                          Desktop-based perspective of the Life Scientist – letting him use
8
  http://www.dcache.org/                                                  the Grid from his local Desktop computer.
9
  https://twiki.cern.ch/twiki/bin/view/LCG/DataManagementDocumenta-         Many attempts have been made to provide Desktop access to the
tion                                                                      Grid. The mission is, summarized, to let a scientist use large scale


                                                                                                                                             3
E. Lammerts



distributed computing and storage resources from a familiar                 Providing annotations for application services on the Grid would
environment. Examples of such initiatives include:                        be a major step. (See Illustration 4: Tying Application Services.)
                                                                          These annotations should specify input parameters and the type of
     • VBrowser10, an attempt to provide access to storage services
                                                                          output generated. By checking them it would be possible to see
       on the Grid through an interface that resembles a standard
                                                                          which application can take the output of another application as its
       file-browser;
                                                                          input. As an additional step, a client application could be
     • inQ11, a browser tool for SRB;                                     developed that checks for available application services and allows
     • jGridStart12, a Java WebStart application to handle certificate    the user to select which applications to run in which order.
       requests and to import a certificate into your browser
       (currently in alpha);
     • GridApps13, a REST (Representational State Transfer)
       (Fielding, R.T. 2000) based interface for applications running
       on the Grid;
     • Leiden Grid Infrastructure14 (LGI), similar to GridApps.
   We believe that such low level interfaces offer real added value
to Life Scientists. They allow the scientist to, instead of dealing
with the Grid itself, do a familiar task from a familiar environment.     Illustration 4: Tying Application Services
Of course, such interfaces do have major dependencies, and an
update or upgrade of one of the many components can (and will)               Another major issue is that of data management. From his old-
break some of them.                                                       modality frame of mind, the Life Scientist has no conception of the
   Attempts of a different kind to provide Grid interfaces are            complexity of data management on the Grid. Concepts like
workflow systems. These systems, such as Taverna15 and Moteur16,          replication or checksums do not come natural, since in his Desktop
allow scientists to define a workflow of Grid jobs. Although these        environment they are irrelevant.
applications are mostly Desktop based (or sometimes on-line, such            An important issue regarding data management is that of its
as the P-GRADE Portal17), their principle is different from that of       current unreliability when storing many smaller files at once. This
the preceding. They often replace the complexity of the Grid with a       is due to technical issues (for which we will make a suggestion in
type of complexity that approaches large scale computing from a           the following chapter), and the available workarounds are hard for
perspective that might be closer to the mindset of the scientist.         the scientist to handle.
   We believe that such platforms provide much added value for               We believe that the concepts of data management are
many different use-cases, but question its use for general                complicated, but knowledge of them is essential when working
application within the Life Sciences. We notice that Life Science         with the Grid. Therefore, as long as no alternative is available, our
research groups that utilize such software need to go through a           suggestion is to generate more teaching material on how data
learning curve of which the steepness is not proportional to the          management works and how it should be used, specifically by the
increase of productivity it delivers. Even though Life Scientists do      Life Scientists.
have a need for simple workflows, these workflows are often the              Another option is to develop an interface based on common and
same or very similar across their domain. Therefore it should be          familiar protocols, such as WebDAV18, while maintaining the
possible to take a simpler, but maybe less generic, approach.             current technologies. The advantage of such protocols is that the
   An example of such an approach is annotated application                remote file system can be mounted easily, because most operating
services. Providing access to applications running on the Grid in         systems provide native support. However, the implementations of
the way GridApps does is a good first step. However, after                these protocols on top of the existing infrastructure might prove
collecting data, the Life Scientist typically needs to do more than a     problematic.
single atomic step (as provided by GridApps) to get useful
information from his data. It is common that the Life Scientist
                                                                          3.2       Adapting the Grid to suit the Life Sciences
needs the Grid to pre-process, process and visualize or interpret his     Apart from providing interfaces that fit the Life Scientist, much
data. Using GridApps he can do all of these steps, providing that         work can be done to improve technical concepts of the Grid. Two
the applications he needs are available. It would be more useful,         major issues can be identified. Respectively:
however, if he could specify all his actions at once. That way he              • Data Management
would not have to deal with moving of data, monitoring of his
jobs, etcetera.                                                                • Job failure rates

10
                                                                             The technical issues with Data Management on the Grid are well
   http://staff.science.uva.nl/~ptdeboer/vlet/page_vbrowser.html          known and documented. Although the concepts of distributed
11
   http://www.sdsc.edu/srb/index.php/InQ                                  storage as used on the Grid has been proven to be successful in
12
   http://www.nikhef.nl/pub/projects/grid/gridwiki/index.php/JGridstart   other domains, on the Grid it is to a lesser degree, for reasons that
13
   https://ws2.grid.sara.nl/apps/                                         have been discussed in the previous chapter.
14
   http://fwnc7003.leidenuniv.nl/LGI/                                        Based on the identified problems with the SRM protocol, which
15
   http://taverna.sourceforge.net/                                        in our opinion are the cause for most of the technical issues with
16
   http://modalis.polytech.unice.fr/softwares/moteur/start
17                                                                        18
   http://portal.p-grade.hu/                                                   http://www.webdav.org/



4
                                                         Life Sciences & The Dutch Grid: An Analysis from a Grid Supporter's perspective



Data Management on the Grid, we propose to implement a                          • Condor21 GlideinWMS (Sfiligoi, I. 2007). Provides a similar
different protocol. During selection the following criteria should be             approach, but then from a Workload Management System
maintained:                                                                       (WMS) perspective.
More efficient for smaller files. Since SRM is not able to deal with       3.3       Organizational suggestions
    many small files, the next protocol should be more efficient.          Enabling the the Life Science researcher to use the Dutch Grid
Common for enabling access to distributed storage. If we choose a          infrastructure asks more than technical adjustments and additions.
    protocol that is widespread among different domains, that              It is important that the dissemination of information is handled in a
    means many clients and tools are available.                            way that comes natural to the Life Scientist, to provide him with a
Not academic. By choosing a protocol that goes beyond the                  platform that has information dedicated to his specific problems,
    academic perspective, we can define a better framework for             and to provide clear channels for support. The details of such
    our services. In the end, we as supporters, should provide a           organization are expected to be more of an internal issue, and
    service to our customers, the Life Scientist (among others).           therefore out of the scope of this discussion. We do give an
    Since not-academic software is much easier to rely on, we are          overview of a situation that seems a promising improvement.
    able to go as far as defining a Service Level Agreement                   We propose to introduce VO specific support and
    (SLA).                                                                 documentation, in which the VO manager plays an active and
   The issue of job failure rates asks for a different approach. There     crucial role. Since a VO is build around a notion of similarity of its
is no widely available alternative with the promise of better              members, we can assume that the Grid is used in similar ways to
performance.                                                               achieve similar goals. Because these ways are directly related to
   Since most of the errors occur between the a job submission and         the research, it makes sense to enforce the generation of
the point it lands on a Worker Node, effort needs to be spent on           documentation on tips and tools. Sharing is the keyword here.
more actively advocating the use of Pilot Jobs. These are Grid jobs           But what means do we have to enforce this? Since currently all
that run on a meta-level; their main purpose is to get onto a Worker       support is delivered from a central place, we can distribute
Node and check the environment. After this is done, it can fetch a         responsibilities and introduce a new first-line support, dedicated to
task definition from an external source and start processing this          a specific VO. Since a VO does not have a dedicated person for
task. If all goes well, it can store the results of the task, delete the   support, this channel needs to be maintained by the VO members.
task definition on the external source, and fetch a new task (also         The perfect tool for this is a 'many-to-many' mailing list as applied
see Illustration 5: Pilot Jobs).                                           by many successful open-source projects. A search-able archive
                                                                           needs to be maintained, making up for documentation. Of course
                                                                           there should be a possibility to escalate a question. For this we
                                                                           suggest to keep on using a ticket system.
                                                                              Further more, an on-line repository needs to be maintained, in
                                                                           which static documentation is stored. This repository needs to
                                                                           include all VO specific documentation as well as links to generic
                                                                           documentation. Another useful addition is a search-able database
                                                                           of public papers, created in the context of this VO.

                                                                           4       DISCUSSION
                                                                           The Grid has the potential to play an important role in data-centric
                                                                           Life Science research. There are, however, still many problems to
                                                                           be solved. These problems are on three distinct levels;
                                                                           organization, technological and naivety. Currently a solution is
                                                                           mostly being looked for on a technical level. However, it is our
                                                                           believe that the solution is only to be found by keeping the bigger
                                                                           picture in mind. We need to build a realization about the identity of
                                                                           the Life Scientist and his research, and try to find solutions that fit
                                                                           in his modality of research, mindset and capabilities.
Illustration 5: Pilot Jobs                                                    Of course many technical improvements can be made as well.
                                                                           An important one, from the Life Scientist's perspective, is Data
     There are some frameworks available that support this concept.        Management. The current solution has not proven suitable for the
     • Token Pool Server19 (ToPoS). An extremely efficient, open           Life Sciences and effort needs to be put into finding a new
       source, REST based interface to store task definitions;             solution. Again, it is important to match this solution to a number
                                                                           of criteria to make sure it is an actual workable solution.
     • Distributed Analysis Environment20 (DIANE). Provides
                                                                              On an organizational level it is important to build stronger
       automatic control and scheduling of computations, and is part
                                                                           Virtual Organizations that have a sense of self-preservation. This
       of the EGEE respect suit;
                                                                           means that knowledge specific to a VO should be contained and
19
     http://topos.grid.sara.nl/4/
20                                                                         21
     http://it-proj-diane.web.cern.ch/it-proj-diane/                            http://www.uscms.org/SoftwareComputing/Grid/WMS/glideinWMS/



                                                                                                                                                5
E. Lammerts



disseminated within the VO itself. To suit this set-up we need to
reorganize our knowledge dissemination and support.
   Further discussion and analysis is needed. It is important to be
aware that not only the Life Scientist needs to change to suit the
technology, we as Grid supporters also need to facilitate change to
suit the Life Scientist. Lets move on to find the right balance
between technology push and technology pull, to teach both
ourselves and the Life Scientist and to find forms of organization
that fit the natural form of the Life Science community.

ACKNOWLEDGEMENTS
The observations and corresponding suggestions in this paper were
obtained through, apart from own experiences, many discussions
with colleagues in the e-Science Support Group of the department
of High Performance Computing and Visualization. The paper has
been written with support from, and in the context of, the
International Workshop on Portals for the Life Sciences (IWPLS)
'09.

REFERENCES
Fielding, R.T. (2000) Representational State Transfer (REST) Architectural Styles
     and the Design of Network-based Software Architectures, University of
     California, Irvine (CA), pp. 94-124
Sfiligoi, I. (2007) Journal of Physics: Conference Series glideinWMS – A generic
     pilot-based Workload Management System, 119 2-5




6