<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>On Distributed R Computations over BOINC</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Alexander Rumyantsev</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Anna Eparskaya</string-name>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Enrico Blanzieri</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Valter Cavecchia</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>CNR-IMEM Via alla Cascata 56/C</institution>
          ,
          <addr-line>38123 Trento</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>DISI, Department of Information Engineering and Computer Science University of Trento</institution>
          ,
          <addr-line>Via Sommarive 9, 38123 Povo (TN)</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Institute of Appiled Mathematical Research, Karelian Research Centre of RAS</institution>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>Lenina Pr.</institution>
          ,
          <addr-line>Petrozavodsk, 185910</addr-line>
          ,
          <country country="RU">Russia</country>
        </aff>
        <aff id="aff4">
          <label>4</label>
          <institution>Petrozavodsk State University</institution>
        </aff>
        <aff id="aff5">
          <label>5</label>
          <institution>Pushkinskaya Str.</institution>
          ,
          <addr-line>Petrozavodsk, 185910</addr-line>
          ,
          <country country="RU">Russia</country>
        </aff>
      </contrib-group>
      <fpage>108</fpage>
      <lpage>113</lpage>
      <abstract>
        <p>We discuss the technologies for executing R language applications in high-performance and distributed computing environments. The concept of running R applications in BOINC infrastructure is presented. Possible applications that might benefit from the distributed computing environment available in R are discussed.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
    </sec>
    <sec id="sec-2">
      <title>High-Performance and Distributed Computing in R Language</title>
      <p>In this section we summarize the methods of parallel and distributed execution of an R script/command. We
refer the reader to an extensive review of R parallel and distributed computing https://cran.r-project.org/
web/views/HighPerformanceComputing.html, whereas we provide an alternative grouping of packages, based
on the level of abstraction.</p>
      <p>First we note, that code vectorization is one of the basic concepts of R language, and many basic functions
accept vector arguments. As well as providing computations speedup, it also makes a specific language usage
pattern based on the apply-type functions (applying a function to a vector/array of arguments) as an alternative
to for loops. At that, one of the expected ways of organizing implicit parallelism in R is done by a background
parallelization of the apply function. This programming style is more natural for applied scientists, who prefer
to focus on their application rather than optimize the code. Alternatively, we expect that a parallel programming
professional, who is using R as one of the steps in the analysis, would use (at various abstraction level) wrappers
of parallel programming instruments, such as MPI, OpenMP, CUDA, OpenCL etc., which we will refer below
as explicit parallelism. There are also some distinct promising concepts of parallel code execution, such as
the futures concept (packages future and future.batchjobs) for running independent jobs on parallel and
distributed systems, or ddR package family for distributed data analysis, however, we will focus below on the
first two approaches.</p>
      <p>The explicit parallelism model requires low-level programming and expertise in the high-performance and/or
distributed computing field, but provides a more computationally effective code. Traditionally, writing an
explicitly parallel application in R required the following steps:
1. choosing and configuring a parallel/distributed backend (including some configuration work outside the R
environment);
2. writing the code with parallel execution in mind, including the data distribution and result aggregation,
self identification of the process in Single Instruction Multiple Data model, data transfer by means of MPI,
Sockets etc.;
3. caring about load balancing, interaction with schedulers and cluster management software (if any), as well
as fault tolerance.</p>
      <p>This workflow has recently been added to the core functionality of R language, by means of the parallel package
(which is distributed as an essential package with the R core), inheriting the shared-memory multicore computing
features (formerly provided by multicore package) and the multiserver computing features on a Beowulf-type
cluster, known as Simple Network of Workstations, with socket-based communications (the functionality provided
by a deprecated snow package). When the application requires intensive data transfer between computing nodes,
the Rmpi package is used as a wrapper for MPI API. Finally, the parallel package provides the mcapply
command allowing to perform parallel execution of apply-type fuction.</p>
      <p>A higher-level abstraction of the explicit parallelism model is provided with a combination of packages foreach
and the do* package family (including doParallel, doMC, doMPI, doSNOW etc.) built over the aforementioned
parallel package. This combination is based on the concept of iterators (recalling the for loops with
asynchronous execution of loop body without an explicit loop counter), and allows to organize not only data parallel
processing, but also working with streaming data. The do* packages allow to organize the backend, including
working with high-performance clusters, whereas the foreach package with a special %dopar% pragma allows to
parallelize the loop body execution. Note, that the code style provided by this package combination resembles
more the C++ programming, rather than R scripting. In this regard, even more performance-oriented and
fine-grained parallel programming can be done with special package wrappers for parallel programming tools
outside R environment, including RcppParallel and RcppBlaze. The package snowfall allows to implement
the master–worker parallel execution scheme over an MPI cluster, however, with an unavoidable configuration
work, including LAM/MPI configuration. A similar functionality is provided by an Rhpc package (with a
programming style resembling writing an MPI equipped parallel application on C++/Fortran). The multidplyr
package, acting as a backend for dplyr package and based on the functionality of parallel, is more focused on
the applied analysis, however, still requiring some explicit work such as data spreading and cluster initialization.</p>
      <p>High-performance computing on GP-GPU and co-processors is widely used for specific type of parallel
applications. Following this trend, a number of packages is available for R language, which either are interfaces
for the specific language (e.g. OpenCL, RCUDA, RViennaCL), or implement specific algorithms in various research
fields, such as data mining (gputools), statistics (cudaBayesreg), and bioinformatics (permGPU). There are
also some supplementary packages dealing with data structures on the GPU and data transfer between the
GPU/co-processor and the host machine (gmatrix, gpuR).</p>
      <p>Distributed computing with distributed data structures is a natural way of working with big datasets. At that,
several R packages implement the explicit distributed computing methods, some of them relying on the widely
used backends, such as Hadoop, e.g. datadr, fileplyr, DSL, startR, or a web application oriented package
distcomp. We note, however, that the aforementioned packages for distributed computing do not directly fit the
Volunteer Computing model, but instead, are capable of working in an infrastructure and environment controlled
by the code writer (e.g. require remote access to machine memory etc.). Among the several attempts to create a
package for Volunteer Computing backend (including the archive GridR package), to the best of our knowledge,
there is no actively maintained package.</p>
      <p>We conclude this section with discussing several packages for implicit parallelization. These packages mostly
provide an apply-type function for performing parallel execution of batch jobs over an existing parallel
infrastructure operated by a scheduler (e.g. LSF, SGE, SLURM), or distributed system (e.g. docker Swarm). The
package rslurm allows a seamless operation with SLURM-based high-performance cluster, performing
transparently the task distribution and results gathering in an asynchronous way. The packages clustermq and
batchtools, instead, allow to interact in a common way with a large variety of high-performance schedulers,
including SLURM, TORQUE, docker Swarm etc. While clustermq allows a more fine tuning of the system, by
manual initialization of the master-worker scheme, the batchtools allow a more transparent and user-friendly
way to setup parallel execution and wait for the result reduction.</p>
      <p>To summarize the section, we note, that the focus of package developers in the field of parallel and distributed
computing has shifted from low-level explicit parallelism to more high-level implicit methods. The recently
introduced packages allow to interact with a given infrastructure (including schedulers etc.), rather than to
create your own. And currently there are no actively maintained packages for Volunteer Computing infrastructure
backends in R language.
3</p>
    </sec>
    <sec id="sec-3">
      <title>RBOINC Concept</title>
      <p>In this section we discuss the RBOINC extension package concept, that allows to utilize the computational power
of idle time of volunteer computers. We note, that BOINC software [And04] is a sophisticated tool for organizing
both the Volunteer Computing projects, as well as desktop grids of enterprise level [Iva15].</p>
      <p>Following the conclusions in Section 2, we define the following general requirements for the RBOINC package:
• seamless integration of a BOINC project as a desktop grid backend,
• transparent split and merge operations of the parameters array,
• apply-type function for asynchronous, fault-tolerant and reliable obtaining of the result of computations.
We note, that RSLURM package is the package with the most similar functionality and architecture. Inspired by
that, we recommend an umbrella BOINC project for running R code to be used as a BOINC backend for the
package. We assume, that the essential settings of the project (including the level of redundancy for a workunit,
the deadline for computing etc.) will be set up at the workunit level, with a possibility to pass the specific
settings to the BOINC server by means of the package functionality.</p>
      <p>We assume, that the apply-type function will allow to perform a seamless parameter space split, workunit
generation and computations initialization. Similarly to RSLURM package, we assume, that the results of
computations will be obtained asynchronously, with a possibility to obtain intermediate (incomplete) results, as well
as to monitor the current progress level. The architecture of the concept is presented on Fig. 1.</p>
      <p>We note some technical difficulties that the package might have to deal with. First, the R software should be
installed on the volunteer’s host. In this regard, we will consider the virtual machines creation, or, alternatively,
the portable R software setup. Fault tolerance issues should be taken into account, however, BOINC platform
has internal methods to deal with faults, deadline violations and malicious activity. Note also, that a BOINC
application should use the checkpoints mechanism to suspend the activity when the volunteer’s host activity
increases over the given threshold. The checkpointing mechanism should essentially be implemented. We also
assume to utilize the built-in multicore features provided by the parallel package.
In this section we consider the task of gene network expansion has addressed by the gene@home project and how
the project would have benefited and could benefit in the future from the availability of the RBOINC package.</p>
      <p>The biological knowledge about the regulation of genes is in general incomplete. Without the details of the
whole set of events that connects the transcription abundance of two genes, a gene regulatory network represents
by means of connections (facilitatory or inhibitory) the causal relationships between genes transcription level.
Their correct description allows either predicting the behaviour or manipulating the system. Given an incomplete
gene regulatory network the task of Gene Network Expansion (GNE in the following) is to provide a list of other
genes (possibly equipped with their connections with the genes of the input network) whose order should reflect,
for each gene, the confidence of being actually part of the correct gene regulatory network. The task emerges
from the observation that, in life science practice the biologist has, or experimentally gain, notion of what could
be some of the genes that are relevant for the process under study. Expanding this initial knowledge with other
genes can suggest relevant genes and pathways for further scrutiny and testing, leading to an increase of the
biological knowledge.</p>
      <p>In order to address the GNE task a Trento-based group of researchers (including two of the authors) devised,
with the help of several collaborators, the gene@home project (http://gene.disi.unitn.it) running on the
BOINC platform. In particular the platform hosted the implementation of an algorithm called Network
Expansion by Stratified Subsetting and Ranking Aggregation (NES2RA) [AMC+16] that is based on the stratified
subsetting of the variables that are fed to the PC-algorithm [SG91] (PC in the following). PC is named after
the first names of the authors who proposed it, and it infers causal relationships between variables by means
of Conditional Independence (CI) tests and its application permits to discover direct causal relationships
between correlated variables using observational data. PC has been applied to a wide range of domains including
Yeast gene expression data [M+10] and QSAR/QSPR analyses [SGLP16] So far the gene@home project has
systematically applied NES2RA to Escherichia coli and Vitis vinifera data.</p>
      <p>The gene@home developers have done many activities to setup the application. At the very beginning they
tested the PC algorithm whose ’skeleton’ part is used in the iterated version of the PC (called PC-IM) aimed to
GNE and distributed via BOINC, see Fig. 2. NES2RA can be considered a parameter sweep of the PC-IM and
it is computationally demanding, for example, it requires ∼260K PC-algorithm runs for expanding a network of
Vitis vinifera (∼28K genes). The tests of the PC algorithm were done using the implementation available in the
R pcalg package (Methods for Graphical Models and Causal Inference) that proved to be impractical for the use
planned by the researchers. The computational power required for processing several loosely-coupled parallel
subtasks coming from NES2RA, led them to rewrote the ’skeleton’ procedure in C++ and to use BOINC with a
very fast application written with the BOINC API. They also improved the code with the removal of recursion.
The benchmarks showed that the speedup they achieved was ∼240x, also reducing memory requirements (see
[ASM+15] ). They also planned to ’push back’ their code into R, that would have given to the R users community
a much more efficient version of the original algorithm, however the activity is currently postponed to the near
future.</p>
      <p>Was the RBOINC package available to the gene@home team at the very beginning, the effort to set-up and
evaluate the approach would have been considerably smaller. Moreover RBOINC would have given a convenient
way to test the competitors, for example the network inference methods in R packages, using BOINC without
the need of developing specific BOINC applications. The need to rewrite the PC-algorithm would have been
impossible to avoid but the new version would be now integrated in R for general use. Finally, the general
availability of NES2RA for other users would be now rather straightforward within the R environment.</p>
      <p>When the RBOINC package will be ready we plan to use it for improving several aspects of the gene@home
project. The main benefit is that our approach to GNE will be available in an integrated analysis environment.
This circumstance will permit us to use BOINC for tasks, like ranking aggregation in NES2RA or postprocessing
or even data subsets generation, that now run just on servers and, in some cases, have proved to be the bottleneck
of the application. As mentioned above the issue of the availability would be solved and the general user would
also have the possibility to integrate the GNE results with comparisons and tests in a familiar and flexible
environment.
5</p>
    </sec>
    <sec id="sec-4">
      <title>Conclusions and Future Work</title>
      <p>The paper presented the concept of RBOINC package whose goal is to allow for the deployment of computations
over desktop grids running under the BOINC platform. This functionality should be seamlessly available within
the R environment. We reviewed the related R packages and sketched the main requirements of the new package.
Moreover we considered the potential impact over the project gene@home devoted to gene network expansion.
Gene@home is a BOINC project that implements a specific data analysis procedure, more in general the RBOINC
package will permit the run of several different potentially-demanding analysis on the distributed volunteer grid.
Future work is required to actually develop and implement the package. Nowadays with the growing quantity
of data produced, the RBOINC package can play a key role not only for the solution of the GNE task but also to
tackle several other problems that requires intense computation on extensive resources.</p>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgements</title>
      <p>The work of AR is partially supported by RFBR, projects 15-07-02341, 15-07-02354, 15-29-07974, 16-07-00622,
and by President RF’s grant No.MK-1641.2017.1.
[R C17]</p>
      <p>Evgeny Ivashko. Enterprise desktop grids. In Proceedings of the Second International
Conference BOINC-based High Performance Computing: Fundamental Research and Development
(BOINC:FAST 2015), pages 16–21. CEUR Workshop Proceedings, Vol-1502, 2015.</p>
      <p>M. H. Maathuis et al. Predicting causal effects in large-scale systems from observational data. Nat.
Methods, 7(4):247–248, Apr 2010.</p>
      <p>R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for
Statistical Computing, Vienna, Austria, 2017.</p>
      <p>P. Spirtes and C. Glymour. An algorithm for fast recovery of sparse causal graphs. Social Science
Computer Review, 9:62–72, 1991.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [AMC+16]
          <string-name>
            <surname>Francesco</surname>
            <given-names>Asnicar</given-names>
          </string-name>
          , Luca Masera, Emanuela Coller, Caterina Gallo, Nadir Sella, Thomas Tolio, Paolo Morettin, Luca Erculiani, Francesca Galante, Stanislau Semeniuta, Giulia Malacarne, Kristof Engelen, Andrea Argentini, Valter Cavecchia, Claudio Moser, and Enrico Blanzieri.
          <article-title>NES2RA: Network expansion by stratified variable subsetting and ranking aggregation</article-title>
          .
          <source>The International Journal of High Performance Computing Applications</source>
          ,
          <volume>0</volume>
          (
          <issue>0</issue>
          ):
          <fpage>1094342016662508</fpage>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [And04]
          <string-name>
            <given-names>David P.</given-names>
            <surname>Anderson</surname>
          </string-name>
          .
          <article-title>Boinc: A system for public-resource computing and storage</article-title>
          .
          <source>In 5th IEEE/ACM International Workshop on Grid Computing</source>
          , pages
          <fpage>4</fpage>
          -
          <lpage>10</lpage>
          ,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [ASM+15]
          <string-name>
            <surname>Francesco</surname>
            <given-names>Asnicar</given-names>
          </string-name>
          , Nadir Sella, Luca Masera, Paolo Morettin, Thomas Tolio, Stanislau Semeniuta, Claudio Moser, Enrico Blanzieri, and
          <string-name>
            <given-names>Valter</given-names>
            <surname>Cavecchia</surname>
          </string-name>
          .
          <article-title>TN-Grid and gene@ home project: Volunteer computing for bioinformatics</article-title>
          . In Second international conference
          <article-title>BOINC-based high performance computing: Fundamental research and development (BOINC: FAST</article-title>
          <year>2015</year>
          ), number
          <volume>1502</volume>
          , pages
          <fpage>1</fpage>
          -
          <lpage>15</lpage>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [Iva15] [M+10] [SGLP16]
          <string-name>
            <given-names>Natalia</given-names>
            <surname>Sizochenko</surname>
          </string-name>
          , Agnieszka Gajewicz, Jerzy Leszczynski, and
          <string-name>
            <given-names>Tomasz</given-names>
            <surname>Puzyn</surname>
          </string-name>
          .
          <article-title>Causation or only correlation? application of causal inference graphs for evaluating causality in nano-QSAR models</article-title>
          .
          <source>Nanoscale</source>
          ,
          <volume>8</volume>
          (
          <issue>13</issue>
          ):
          <fpage>7203</fpage>
          -
          <lpage>7208</lpage>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>