=Paper= {{Paper |id=None |storemode=property |title=Low Latencies as Conditione sine qua non for Interactive Data Exploration and Timely Collaboration |pdfUrl=https://ceur-ws.org/Vol-819/paper7.pdf |volume=Vol-819 |dblpUrl=https://dblp.org/rec/conf/iwsg/KrabbenhoftM11 }} ==Low Latencies as Conditione sine qua non for Interactive Data Exploration and Timely Collaboration== https://ceur-ws.org/Vol-819/paper7.pdf
Low latencies as conditione sine qua non
for interactive data exploration and timely collaboration
Hajo N. Krabbenhöft1,*, Steffen Möller2
1
    University of Lübeck, Institute for Neuro- and Bioininformatics, Lübeck, Germany
2
    University of Lübeck, Department of Dermatology, Lübeck, Germany




ABSTRACT                                                                        underpinning every decision and conference call with a synchro-
Motivation: High-throughput technologies, like gene expression                  nized shared data set, group communication is greatly improved.
arrays and next-generation sequencing, provide enormous data                    This work demonstrates that the interactivity with the user to work on
sets, which are too large to transfer or download quickly. The study            large data sets is strengthened with remote applications and typical
of such data, for our application this means explaining the meas-               “show next page” delays are overcome by employing the latest web
urements with a molecular interpretation of disease etiology, re-               technologies. This way, the strong server-user interaction allows for
quires continuous updates and refinements as novel interpretations              the seamless extension for serving additional users and thus allow
are pursued. The complexity of the problem requires a diverse range             for collaborations.
of expertise. And thus a shared view is crucial for a successful col-
laboration - within and between institutions.                                   Availability: Source code for the web application and the data stor-
                                                                                age back-end was released under GNU Lesser General Public Li-
Web services and traditional web pages provide centralized data                 cense and is freely available for download from
storage and synchronized presentation. Relying on a single central              http://github.com/fxtentacle/eQTL-GWT-Cassandra
server, though, comes with its own flavor of reliability and perfor-
mance issues. Every time the server is busy solving a request, the
user is forced to wait. It is therefore very beneficial to combine the          1     INTRODUCTION
integrity of web services and the share-ability of web pages with the           The problem is as old as Bioinformatics itself: biological research
fluency of a desktop application. Increasing the interactivity of data          yields more data than a human can handle manually.
presentation to each individual user allows for a more interactive
knowledge exchange on the group scale.                                             Technological advancements in biochemistry and better insights
                                                                                in computational biology together have accelerated and broadened
Here, we present a combination of Open Source technologies for                  the avalanche of information. We are already losing this fight, as
distributed, synchronized and failure-resistant storage of huge data            much data is generated in labs, which is never rendered available
sets as the technological basement for globally fast access to re-              for analyses in other contexts, since the information is not publicly
search data. Accordingly, this work explores the derived possibilities          available. Given the inter-individual differences of patients and
for interactive presentation to a group of locally distributed research-        controls, many new model organisms, many more tissues investi-
ers, as enabled by a problem-tailored web application. To aid in the            gated at an ever more detailed level and additional test conditions,
investigative work, the user interface shows minimal latencies. The-            the avalanche of usable knowledge will not stop in any foreseeable
se goals are achieved by capitalizing on related developments in                future.
distributed data storage and asynchronous web technologies, most
notably the non-relational database Apache Cassandra and the                       For the analysis of large data, one needs to find human-
Google Web Toolkit. This combines efficient pre-processing with                 digestible aggregations of it. It shall be a goal-oriented presentation
parallelisation.                                                                where the essential information to form hypotheses is brought
                                                                                together. Further statistical evidence in the data will guide the
The developed web application looks akin to a typical desktop appli-            downstream analysis, as will additional external information. For
cation and is highly responsive, since it downloads needed data in              our application, the molecular interpretation of disease phenotypes,
parallel, while the user is happily working. The researcher can pre-            this is a process of continuous refinements and updates, accompa-
pare different data set views for different aspects of his analysis,            nied by fruitful discussions with fellow researchers and collabora-
which are immediately available for colleagues and collaborators. By            tors. Every researcher needs software to look at and evaluate all
                                                                                possible measurements and explanations, without being forced to
                                                                                manually download or update the huge underlying data set. The
*
    To whom correspondence should be addressed.



                               Copyright © 2011 for the individual papers by the papers' authors. Copying permitted only for private and academic purposes.
3rd International Workshop on Science Gateways for Life Sciences (IWSG 2011), 8-10 JUNE 2011



researcher is doing investigative work and, therefore, it should be            3     RESULTS
possible to browse all of the data interactively.
                                                                               3.1    Minimal technical requirements
1.1     Prior art                                                              From the previous PHP implementation, no code could be saved
The most common approach found in the Bioinformatics commu-                    and a separate execution environment was designed from scratch.
nity for presenting and browsing data sets is to store research data           It was taken special care to ensure the application to remain com-
inside a relational database and write custom-made software or                 patible with common IT constraints in research institutions: the
web pages to present the data. Some approaches also include the                application needs HTTP access on a random port for each worker
ability to produce diagrams, but most are limited to text and tabular          node, as well as two configurable ports on which the peer-to-peer
data. While generated web pages are usually static, this form of               communication of the distributed database will take place.
presentation is to be considered interactive if the user can issue
filtering requests to the web server and retrieve a new web page               The system can work with almost any memory and hard disk con-
containing the response within an acceptable response time.                    figuration and every Windows, Mac OS X and Linux computer can
                                                                               be turned into a worker node simply by copying a folder and run-
   Inside a relational database, data are stored as sorted by the pri-         ning a shell script. The worker node software could also be de-
mary key and separate look-up tables are generated for column                  ployed remotely.
range and equals queries. Since these look-up tables need to be
identical for every database server in the system, common database             The used database is data center aware in its distribution of redun-
software does not allow write requests to be distributed. There is,            dancy. Accordingly, a complete self-contained copy of the data is
however, a range of solutions for replicating databases to multiple            kept at each physical facility, if the researcher has appointed one or
servers such that read requests can be distributed. This means that            more machines to use for data storage. The researcher can immedi-
a relational database scales well with huge data sets when it comes            ately start working, even while a local copy of the data set is being
to read requests and does not scale at all with write requests.                synchronized with coworkers and collaborators round the globe
                                                                               automatically. There is no initial waiting time for downloading the
  For very basic applications where research data are imported on-             complete data set or manually deploying updates. Current Linux
ly once and then never modified, this works well. However, as                  distributions like e.g. Debian (Möller et al., 2010) have all packag-
soon as data are being processed or annotated on the user’s behalf,            es readily available or downloadable at the developers’ websites.
the write rate of a single server is sustained. With traditional data-
bases, only reads can be executed on a replica of the data and,                3.2    Reliability and performance
therefore, the system as a whole cannot scale for writes.                      Most current research software does not include any sort of failure
                                                                               tolerance and data replication, which comes as a surprise given the
   We would like to see a word-processor like working environ-                 price of good research data. With growing data sets, storing and
ment: every data editable and inspectable with local views on the              retrieving the correct subset is not a simple task anymore. A chain
full data, helped by search tools and situation-dependent statistics           is only as strong as its weakest limb and therefore a bad database
like word counts. We want to avoid data transfer (takes too long),             and schema choice will completely ruin a data exploration soft-
web-forms based “paged” interactions, any non-scalable compo-                  ware.
nents and want to directly support inter-user communication.
                                                                               Apache Cassandra was chosen because it is stable, fast and repli-
                                                                               cates data. This database is used in production at Facebook with
2     METHODS                                                                  billions of rows and therefore can be assumed to be reliable. Mov-
The system evolved in Java from the PHP implemented TiQS interacting           ing away from conventional relational databases towards a novel
QTL System (tiqs.it). It does not require any local installation work apart    distributed system out of a key-value store and manually managed
from unzipping and executing a shell script. Web service requests are han-     indexes payed off well.
dled by a custom-written Java servlet inside a Jetty 6 (codehaus Founda-
tion) servlet container. Jetty was chosen for its capability to run with the
                                                                               Database query times average at 25ms and the system has shown a
same configuration file on Windows, Mac OS X and Linux. Its direct com-
petitor, Apache Tomcat, instead needs to be set up individually for each       maximum in throughput of 10,000 expression QTL entries written
machine.                                                                       per second on a single machine. Evaluating the performance on
                                                                               three machines showed that scalability was achieved and is simple
   Data describing the topology of the system are stored in a PostgreSQL       to set up. When the database system is using the same configura-
database (www.postgresql.org) with Hibernate 3 (www.hibernate.org). The        tion file on every machine, different worker nodes will automati-
valuable scientific data is stored inside a distributed Apache Cassandra       cally find each other and relocate the distributed data accordingly.
(Lakshman and Malik, 2010) database. Load distribution is handled by a         Fault tolerance was evaluated by randomly disconnecting one of
nginx load balancer (wiki.nginx.org), which was chosen for its high per-       the three machines from the network. While the processing time
formance, stability and ease of configuration. Optimized example data set
                                                                               for background tasks did go up when disconnecting an active
refinements, for the example of our expression QTL are provided. There is
                                                                               worker node, the presentation front-end still responded as fast as
a plug-in API which allows researchers to write arbitrary filters and data
processors in Java.                                                            before.

                                                                               It was also verified that the data set stays complete and consistent
                                                                               as long as not more than half of the worker nodes are disconnected


This volume is published and copyrighted by its editors.
                                                        3rd International Workshop on Science Gateways for Life Sciences (IWSG 2011), 8-10 JUNE 2011



                                                                                A

                                                                                B
                                                                                C


                                                                                               D

                                                                                               E
                                                                                                                                                        G




                                                                                        D     E                             F

                                                                                                   G



                                                                               Fig. 2. The chromosomal map view. One can easily see which loci are
                                                                               interacting with which genes in the map area (F) as well as their border
Fig. 1. The web application after loading a data set and selecting a data
                                                                               distribution (E). Note that the researcher has hidden the data set layer selec-
set layer. The chromosomal gauge view shows an overview of the chro-
                                                                               tor on the top to give more room for viewing the data. Shown are also the
mosome, with important expression QTL locations marked by
                                                                               menu bar (A), the tab area for selecting a preferred presentation type (B),
green/yellow blocks. Right alongside the overview, a table is displayed
                                                                               the settings for the chromosomal map presentation (C) and the chromosome
which shows the 25 most important expression QTL inside the area shown
                                                                               bands, retrieved from Ensembl using DAS (D). On this zoom level, the
by the chromosomal gauge.
                                                                               scrollbars (G) are grayed out and the contig and transcript DAS tracks are
                                                                               inactive. The user can move click any displayed expression QTL or DAS
at the same time. So from a scalability and reliability point of view,         track to get more information. The user can also drag and move the chromo-
this novel approach is superior to any conventional system relying             somal map view and zoom in or out using the mouse wheel. This allows the
on a single centralized database.                                              user to dynamically switch between a genome-wide overview and a local
                                                                               detail view depending on the task at hand.
3.3     Example user interface for expression QTL
Above described technologies where applied for interactively pre-              created and prepared for viewing in the background. The research-
senting high-throughput data in statistical genetics. A web applica-           er can thereby prepare different presentations of his data for differ-
tion provides a front-end to expression QTL data in a genomic                  ent aspects of his analysis. Since data set layers need to be com-
context provided by DAS (Dowell et al., 2001). This ensures that               puted only once, as opposed to workflows, for example, these
users can easily share data with each other by sharing their links.            views of the data set are immediately available for colleagues and
By integrating a menu bar and by allowing the web page to be                   collaborators.
viewed in full screen, the user can interact with the web application
akin to a desktop program. Since the whole program is run in the               On the lower half of the screen, one can see the chromosome
user’s web browser, the user can use any operating system and                  browser with DAS tracks and annotations for the provided research
does not require any prerequisites, except for the aforementioned              data. One feature of the novel approach that was received especial-
web browser.                                                                   ly well was the ability to change the viewing area in the chromo-
                                                                               some browser without reloading the page. The user can click on
Figure 1 shows the developed web application running inside the                and drag the chromosome to scroll. While the user is moving the
web browser Google Chrome on Mac OS X. On the top, one can                     display area using his mouse, the web application downloads the
see the so-called data set layers. When the user invokes a filter or a         needed data in parallel, so it can update the view while the user is
processing operation, a modified copy of the shown data set is                 still scrolling. The table view below is also dynamically updated to



                              Copyright © 2011 for the individual papers by the papers' authors. Copying permitted only for private and academic purposes.
3rd International Workshop on Science Gateways for Life Sciences (IWSG 2011), 8-10 JUNE 2011



always show the most relevant data rows, in this case the 25 most          A research collaboration includes a number of computers distribut-
probable expression QTL in the specific area of the chromosome,            ed over several networks. It is safe to assume that on such a scale,
                                                                           at least one machine or network connection will fail. When scaling
                                                                           to work with huge data sets, even more computational power is
                                                                           needed and with more machines, component failures will get more
                                                                           and more often. In fact, we planned for and accepted them as part
                                                                           of the normal operation of a distributed software system.

                                                                           For the system to stay operational and interactive under such con-
                                                                           ditions, the data need to be replicated. One simply cannot afford to
                                                                           lose data. Also, no worker node in the system should pose a single
                                                                           point of failure. Therefore, all nodes are running the same software
                                                                           and communicate with each other as equal peers. This approach is
                                                                           a stark contrast to the commonly used pattern of master-slave data-
                                                                           base replication. In cloud environments, this homogeneous config-
                                                                           uration enables us to provide demand driven load balancing.

Fig. 3. Low latency is achieved by employing a traditional application     3.4.2.     Copy on write
model written in JavaScript to execute inside the users web browser. All   Not duplicating read-only data has been common sense in operat-
data is being transferred asynchronously with a datacenter-aware multi-    ing system design for decades, however with the growing amount
layered caching and replication strategy.                                  of data stored in research databases, it is becoming increasingly
                                                                           important for maintaining a high read throughput. While creating a
which is currently visible in the chromosome browser. The chro-            newly aggregated data presentation should preferably be fast, it is a
mosomal map view shown in figure 2 was found to work great for             rare event when compared to inspecting the data through already
getting a quick overview of which gene is interacting with which           existing presentation views.
loci and later on investigating those locations. To ease the process
of following up on these locations, transcripts and known genes            A research database should therefore rather create a modified and
retrieved from Ensembl DAS tracks are displayed alongside the              filtered copy of the data rather than using complex WHERE claus-
expression QTL data when the user zooms in. The columns used               es and JOINs. If JOINs are unavoidable, most database systems
for positioning along the X and Y axis can be freely chosen to             provide a VIEW capability which gives the developer a warm
accommodate different flavors of two-dimensional data.                     fuzzy feeling of having thought ahead. Sadly, most VIEWs are not
                                                                           materialized by default. Therefore, performance of a standard da-
The web application approach is highly reactive in comparison to           tabase VIEW is as bad as calling the underlying JOINs and
normal web pages. That is presumably because most calculations             WHERE clauses on every access to any row of the VIEW.
are done when data are written. Hence, the data are stored already
prepared, sorted and preformatted, which makes their presentation          This might seem trivial to state, but if one knows beforehand, that
cheap in terms of network and CPU usage.                                   a certain VIEW will only be modified sparingly, that VIEW should
                                                                           be manually materialized. (CREATE TABLE … SELECT …)
By working with a synchronized local copy of the database, read            This trades a one-time creation overhead in return for significantly
latency is as low as for traditional desktop applications. Each sepa-      increased read throughput on following queries.
rate user benefits from this increased interactivity, thereby acceler-
ating overall team communication.                                          Using these two optimization techniques, the computational work
                                                                           can be moved from the presentation towards storage of the data,
                                                                           which allows for parallelisation and distribution on a compute grid.
3.4      How to obtain reliability, scalability
                                                                           Keeping data stored the way it’s supposed to be presented also
         and interactivity                                                 ensures that all researches in the collaboration, independent of their
3.4.1.     Homogeneous replication of software and data                    available computing power, can immediately access and work with
                                                                           all presentations of the data.
With exceptions for mobile computing, a local copy is always
faster to access than relying on a remote service. However, users          3.4.3.     Distribute work
are not willing to wait hours or even days for a slow initial down-        Research data sets may easily contain thousands of rows. While
load of the whole data set. Therefore, our software system repli-          enriching, annotating or filtering the data set, these rows can be
cates commonly-accessed data automatically while falling back to           processed independently. By distributing one-time computational
remote access while replication or synchronization is in progress.         tasks, such as the creation of a new presentation of the data, to all
This allows new user to immediately start working and by dynami-           machines in the collaboration, everyone can see the result data set
cally creating a local replica of the data set, we ensure that no row      faster.
needs to be sent twice.
                                                                           In our case, using a distributed database gives every worker node
                                                                           low-latency access on the whole data set and so one can actually


This volume is published and copyrighted by its editors.
                                                                       3rd International Workshop on Science Gateways for Life Sciences (IWSG 2011), 8-10 JUNE 2011



do workload distribution on a row scale. If you’re forced to stick                          Relevancy scores are stored alongside in the index table, to allow
with a conventional relational database system, work distribution                           for dynamic merging of blocks. If the user requests the range 0-500
Resolution1:
                                                                                            Table 1. Example data
                  Location 0 - 1000          Location 1000 - 2000

Resolution2:
                                                                                            ID       Position                Score        …
               Location       Location     Location       Location
                0 - 500      500 - 1000   1000 - 1500    1500 - 2000
                                                                                               1     394                     14
Resolution3:                                                                                   2     112                     3
                                                                                               3     113                     5
Fig. 4. Improving display performance by pre-calculating sorted lists of
displayed items in order of decreasing relevance.
                                                                                           Table 2. Example look-up tables


of low-level tasks might even be counter-productive or should at                           Resolution 1, Score      Resolution 3, Score       Resolution 3, Score
                                                                                             Block 1                  Block 1                   Block 2
least be done with reasonably-sized blocks of data.
3.4.4. Prioritized non-blocking presentation as a stream of                                1               14       3             5           1            14
blocks of interest                                                                         3               5        2             3
                                                                                           2               3
Tables are the dominant form of presentation for research data. In
Bioinformatics, gene locations and interactions play an important
role and, therefore, different flavors of genome scales, chromo-                            on resolution 3, we could dynamically merge the tables “Resolu-
some browsers and interaction maps have been invented. Especial-                            tion 3, Block 1” and “Resolution 3, Block 2”. Given the hierar-
ly graphical presentations provide the researcher with a quick                              chical nature of our block scheme, that request would, of course,
overview of his data. The possibility to zoom in and move around                            be easier satisfied by using “Resolution 2, Block 1”.
in a map of his data closes the gap between an overview of the
complete data and a detailed close-up of specific features.                                 Now, for dynamically filling the visual presentation, items are
                                                                                            streamed for the requested zoom level. The streaming follows the
Interactive bars and maps constantly require new data to be shown,                          order of those tables, presenting the most relevant rows first and
as the user is moving around and inspecting different aspects of the                        filling the image as remaining data arrives. The client application
experiment. This puts an enormous strain on the back-end data-                              can then choose itself how many items it needs to adequately
base, since it means a constant flow of search queries for aggregat-                        populate the view and close the connection when enough data was
ing the data to show and distance-comparisons are usually O(N2).                            received. This allows or web application, for example, to dynami-
It is also important, that only the most relevant information for a                         cally adapt the number of displayed expression QTLs to the users
given zoom level is displayed, to make the resulting graphic not                            display resolution.
only sufficient, but also succinct.
                                                                                            4      DISCUSSION
While updating the data or creating a new presentation view, the
                                                                                            Setting aside the parallel database, the major contribution towards
location where each item will be visible on the bar or map is usual-
                                                                                            a new sense of responsiveness is due to the selective transfer of
ly known beforehand. Similarly, when applying our suggestion of
                                                                                            information of blocks from the server to the user. This is what a
copying the data on write time, the relevance of every item for                             local application would also attempt to perform, but what tradi-
every zoom level can also be calculated offline.                                            tional web forms just cannot achieve when they rebuild the page
                                                                                            from scratch.
We therefore propose to divide the possible view area into a hier-
archical set of equally-sized blocks, as seen in figure 4. Assuming                         This way, the introduction of JavaScript - well hidden behind Java
that display position and relevancy score have already been calcu-                          classes by the Google Web Toolkit - contributed far more than just
lated, the data should look akin to table 1.                                                the usual eye candy. The approach was so much more responsive
                                                                                            than traditional PHP-produced tables, showing the same infor-
We now compare these positions to figure 3. On zoom resolution                              mation, that we have not even taken measurements. It was “in-
1, all items are in the first block. The same applies to zoom resolu-                       stant” versus “wait a few seconds”. Also, the approach was found
tion 2. On resolution 3, the item with ID=1 is in the second block,                         to consume less bandwidth. We hence expect this sort of web ap-
while the other two items are in the first block. Now we can create                         plications to be adopted by many public Bioinformatics databases
a look-up table for every block at every resolution, which is easy                          throughout the next years.
given the schema-less nature of our chosen distributed key-value
                                                                                            The technologies described above are used in a series of well
store. Example look-up tables are shown in table 2. Please note
                                                                                            known Internet sites like Facebook or the Google family of web-
that the rows of each table have been sorted by their relevancy
                                                                                            based applications. With these prime examples in mind, and tap-
scores.                                                                                     ping into our experiences gained over this implementation, we




                                      Copyright © 2011 for the individual papers by the papers' authors. Copying permitted only for private and academic purposes.
3rd International Workshop on Science Gateways for Life Sciences (IWSG 2011), 8-10 JUNE 2011



shall compare other contemporary eQTL infrastructures with what            In its choice of presentation methods, Gene Network is very simi-
they could achieve, when they adopted those technologies.                  lar to the R language for scientific computing. Due to Gene Net-
                                                                           works focus on reasonably small data sets, using such a scripting
4.1     XGAP                                                               language as computational back-end seems a wise choice. Gene
A related project is the eXtensible Genotype And Phenotype plat-           Network is there wholeheartedly recommended for analytical and
form (XGAP) (Swertz et al., 2010). The XGAP project aims to                investigative work on classical QTL.
provide a flexible and open platform for working with data sets,
specifically developed with expression QTL data in mind. The               4.3      Extendability of presented concepts
XGAP project aims to make working collaboratively easy and                 By going new ways in terms of data storage, we combined the low
provides integrated tools for importing and exporting data.                latency of local data storage with the benefits and integrity of a
                                                                           centralized storage server. This technological design decision al-
It is noteworthy, that the XGAP project shares many design deci-           lowed us to greatly increase interactivity of our data presentation,
sions with the web application approach presented here. For exam-          without forcing the user to download the complete data set before-
ple, developing a web page rather than a program gives researchers         hand. While during development, there was a strong focus on ex-
the freedom to share the interface and therefore access to the data        pression QTL, that is their positioning on the chromosome and
without requiring the recipient to have a matching operating sys-          their associated genes, the system was designed to be plug-able for
tem and sufficient processing power available. It is expected, that        a multitude of data processors and data visualisation applications.
more and more applications and data interfaces in general will be
developed as web pages to shift the burden of installation and con-        The chromosome browser allows for arbitrary chromosomes to be
figuration from the users towards the software developers.                 displayed along with arbitrary annotation information, as long as
                                                                           the DAS file format is being used. Similarly, the map view allows
The second shared approach is that of grid computing and back-             for arbitrary positioning measures to be used on the X and Y axis,
ground processing. XGAP supports invocation of computationally             as long as there is a data processor available to calculate said posi-
intensive tasks asynchronously in the background, with the work-           tions. While theoretically any user could develop such data proces-
load distributed on a PBS cluster. The novel approach presented in         sors using a simple Java API, it might be beneficial to broaden our
this paper incorporates asynchronous distributed processing as a           showcase of example processors to support additional forms of
core feature and can therefore support load-balancing and failure          high-throughput data.
tolerance at a level deeper than XGAP. It is expected that this trend
will continue and eventually flash over to consumer applications.          Data replication allows a whole team a consistent shared view of
This is essential in order to reap the full benefits of newer proces-      their experiment. A new presentation created by one collaborator is
sors, which come with more and more cores whereas further in-              immediately available to the entire team. Driven by the high inter-
creasing the frequency is getting more and more difficult.                 activity between every user and the web application, overall team
                                                                           communication is speeding up, and there is a general demand for a
XGAP completely lacks a distributed data storage. When their               deeper integration of social aspects into the data presentation. We
MySQL database as a storage back-end breaks down, the XGAP                 envision a future version of our system where researchers can dis-
system will loose all of its data, therefore presenting a single point     cuss current and past measurements in real-time using a special
of failure. Neither is it prepared for parallel data management.           comment and annotation system adapted to work directly on the
                                                                           data presentation.
4.2     Gene Network                                                       With computing nodes gradually getting cheaper and more readily
Another related project is Gene Network (Wang et al., 2003). Gene          available, dynamic grid brokering will replace static worker queues
Network providing follow-up information about genes, loci and              and present us with unprecedented peak amounts of compute pow-
gene networks and their module WebQTL allows the user to up-               er. While grid technologies traditionally suffer from their own
load own research data for further analysis. Gene Network says to          transiency, the distributed and homogeneous nature of our pro-
archive more than 25 years of research data and provides a very            posed system can easily compensate for node failures, while still
good coverage of additional information.                                   retaining near-perfect performance.

Obvious shortcomings of the Gene Network are that data transfer
is not being encrypted using the industry standard HTTPS and that          ACKNOWLEDGEMENTS
there exists no version which the researchers could deploy on-site
                                                                           The authors thank Thomas Martinetz and Saleh Ibrahim for com-
inside their own firewall. Apart from security issues, WebQTL
                                                                           ments and a nice working atmosphere. Lydia Lutter is thanked for
allows for a very convenient analysis of small data sets. It provides
a plentiful selection of visualization methods, such as box plots,         her critical reading of the manuscript.
correlation diagrams and even directed graphs.
                                                                           REFERENCES
From a technical point of view, the Gene Network is considered to          codehaus Foundation. Jetty 6 http server. URL http://jetty.codehaus.org/jetty/.
be inferior to both the novel approach presented as well as XGAP.          William Cookson, Liming Liang, Goncalo Abecasis, Miriam Moffatt, and Mark
Data set presentation in Gene Network is implemented as down-                 Lathrop. Mapping complex disease traits with global gene expression. Nat Rev
loading ready-made images from their web servers. This leaves the             Genet, 10(3): 184–194, 03 2009.
user with no further possibility for interaction than changing pa-         Robin Dowell, Rodney Jokerst, Allen Day, Sean Eddy, and Lincoln Stein. The dis-
rameters and waiting for the next image to be downloaded. Since               tributed annotation system. BMC Bioinformatics, 2(1):7, 2001. ISSN 1471-2105.
the researcher has no possibility of running Gene Network on own              doi: 10.1186/ 1471-2105-2-7
computational resources, the web servers provided by Gene Net-             David Flanagan. JavaScript: The Definitive Guide. O’Reilly Media, Inc., 2006.
work are essentially shared by all users.
                                                                           Ewald Geschwinde and Hans-Jürgen Schonig. Postgresql Developer’s Handbook.




This volume is published and copyrighted by its editors.
                                                               3rd International Workshop on Science Gateways for Life Sciences (IWSG 2011), 8-10 JUNE 2011



   Sams, Indianapolis, IN, USA, 2001.                                                    eon, Abraham M. Rosenbaum, Michael D. Wang, Kun Zhang, Robi D. Mitra, and
International Human Genome Sequencing Consortium. Finishing the euchromatic              George M. Church. Accurate Multiplex Polony Sequencing of an Evolved Bacte-
    sequence of the human genome. Nature, 431(7011):931–945, Oct 2004.                   rial Genome. Science, 309(5741):1728–1732, 2005.

FA Kolpakov, EA Ananko, GB Kolesov, and NA Kolchanov. GeneNet: a gene net-            Morris Swertz, K Joeri Velde, Bruno Tesson, Richard Scheltema, Danny Arends,
   work database and its automated visualization. Bioinformatics, 14(6):529–537,         Gonzalo Vera, Rudi Alberts, Martijn Dijkstra, Paul Schofield, Klaus Schughart,
   1998.                                                                                 John Hancock, Damian Smedley, Katy Wolstencroft, Carole Goble, Engbert de
                                                                                         Brock, Andrew Jones, and Helen ... Parkinson. Xgap: a uniform and extensible da-
Avinash Lakshman and Prashant Malik. Cassandra: a decentralized structured storage       ta model and software platform for genotype and phenotype experiments. Genome
   system. SIGOPS Oper. Syst. Rev., 44(2):35–40, 2010.                                   Biology, 11(3):R27, 2010.
Steffen Möller, Hajo Nils Krabbenhöft, Andreas Tille, David Paleino, Alan Williams,   J M Trent, M Bittner, J Zhang, R Wiltshire, M Ray, Y Su, E Gracia, P Meltzer, J De
    Katy Wolstencroft, Carole Goble, Richard Holland, Dominique Belhachemi,              Risi, L Penland, and P Brown. Use of microgenomic technology for analysis of al-
    Charles Plessy. Community-driven computational biology with Debian Linux.            terations in dna copy number and gene expression in malignant melanoma. Clin.
    BMC Bioinformatics 11 Suppl 12 (2010): S5.                                           Exp. Immunol., 107 Suppl 1:33–40, Jan 1997.
M Rosenberg and D Court. Regulatory sequences involved in the promotion and           Jintao Wang, Robert W Williams, and Kenneth F Manly. Webqtl: web-based complex
  termination of rna transcription. Annu. Rev. Genet, 13:319–53, 1979                     trait analysis. Neuroinformatics, 1(4):299–308, 2003.
R Sachidanandam, D Weissman, S C Schmidt, J M Kakol, L D Stein, G Marth, S            W3C Consortium. Soap version 1.2 part 1: Messaging framework (second edition), a.
   Sherry, J C Mullikin, B J Mortimore, D L Willey, S E Hunt, C G Cole, P C Cog-        W3C Consortium. Web services description language (wsdl), b.
   gill, C M Rice, Z Ning, J Rogers, D R Bentley, P Y Kwok, E R Mardis, R T Yeh,
   B Schultz, L Cook, R Davenport, M Dante, L Fulton, L Hillier, R H Waterston, J     Adam Warski. Envers: Easy entity auditing. URL http://jboss.org/envers/. 37
   D McPherson, B Gilman, S Schaffner, W J Van Etten, D Reich, J Higgins, M J         Michael Widenius, David Axmark, and A. B. Mysql. MySQL Reference Manual.
   Daly, B Blumenstiel, J Baldwin, N Stange-Thomann, M C Zody, L Linton, E S             O’Reilly Media, Inc., 1st edition, 2002.
   Lander, D Altshuler, and International SNP Map Working Group. A map of hu-
                                                                                      Technical diagrams have been created using the “Architecture by Hand” stencil set by
   man genome sequence variation containing 1.42 million single nucleotide poly-
   morphisms. Nature, 409(6822):928–33, Feb 2001. doi: 10.1038/35057149.              Jonathan Brown.

Jay Shendure, Gregory J. Porreca, Nikos B. Reppas, Xiaoxia Lin, John P. McCutch-




                                  Copyright © 2011 for the individual papers by the papers' authors. Copying permitted only for private and academic purposes.
3rd International Workshop on Science Gateways for Life Sciences (IWSG 2011), 8-10 JUNE 2011




This volume is published and copyrighted by its editors.