=Paper= {{Paper |id=Vol-1686/LightningTalkPaper16 |storemode=property |title=None |pdfUrl=https://ceur-ws.org/Vol-1686/WSSSPE4_paper_14.pdf |volume=Vol-1686 }} ==None== https://ceur-ws.org/Vol-1686/WSSSPE4_paper_14.pdf
    The BioDynaMo Project: Creating a Platform for
    Large-Scale Reproducible Biological Simulations
                       Lukas Breitwieser∗ , Roman Bauer† , Alberto Di Meglio∗ , Leonard Johard‡ ,
                 Marcus Kaiser† , Marco Manca∗ , Manuel Mazzara‡ , Fons Rademakers∗ , Max Talanov§ ,
                              Affiliation: ∗ CERN (Switzerland), † Newcastle University (United Kingdom),
                    ‡ Innopolis University (Russian Federation), § Kazan Federal University (Russian Federation)

                                                   Email: biodynamo-talk@cern.ch


   Abstract—Computer simulations have become a very powerful              Our first step was to introduce development techniques and
tool for scientific research. In order to facilitate research in       infrastructure aimed at improving code quality and maintain-
computational biology, the BioDynaMo project aims at a general         ability, which are essential for our long time effort. Based on
platform for biological computer simulations, which should be
executable on hybrid cloud computing systems. This paper de-           available effort, we opted for testing the whole application
scribes challenges and lessons learnt during the early stages of the   rather than writing unit tests for the entire codebase. Existing
software development process, in the context of implementation         demo simulations were taken and transformed into test cases.
issues and the international nature of the collaboration.              The resulting simulation state is then transformed into JSON
                                                                       format and compared to a ground truth obtained from Cx3D
                         I. I NTRODUCTION
                                                                       0.03. A public code repository was created on Github [2] and
   The BioDynaMo project is a long term effort in the field of         connected to the continuous integration service Travis-CI [3]
biological simulation to build a scalable and flexible platform.       that automatically checks every code change if it generates
The purpose is to give life scientists access to increasing            the correct results. This procedure proved to be a good choice
amounts of computational resources and provide a framework             given the goal of improving application performance without
that hides the computational complexity, allows them to focus          changing the final output. Furthermore, a coding styleguide
on their research, and promotes reusability and reproducibility        was selected to ensure that code is readable, maintainable and
of the results from shared open access data. In order to have an       follows best practices. A coding standard is only helpful if
impact on the community, the system must be flexible enough            it is followed by the developers. Thus, tools are needed that
to execute simulations from different specialities with possibly       help to conform to these rules. We chose the Google C++
quite distinct requirements.                                           styleguide [4] which comes with an Eclipse code formatting
   The project started as a code modernization initiative, in-         definition and cpplint, a tool that checks code for violations. A
spired by the scientific principles underlying the simulation          dedicated “BioDynaMo Developers Guide” [5] introduces new
software Cx3D [1]. Cx3D is a software framework that is able           developers to the project, describes conventions beyond the
to simulate the development of neural tissue, based on physical        coding style, e.g. usage of the revision system git, and stresses
mechanisms and neural growth [1]. However, Cx3D can not                the importance of testing and documentation. External contri-
leverage cloud computing systems or coprocessors, and so               butions are introduced through Github’s pull request system
is limited in terms of the simulation size and complexity.             and are reviewed before they are merged into our repository.
Moreover, Cx3D is limited in terms of extendability and                Github also offers an issue tracking system that is helpful to
modifiability for other purposes in computational biology.             report and document software errors and to plan future work
   In general, software modernization is a collective term that        packages. Moreover, communication is an important aspect
subsumes a variety of activities. In our case it means trans-          especially with project partners based in different countries.
forming the application from Java to C++ and changing the              Our team uses the message system Slack [6] for real time low
architecture in a way to utilize multiple levels of parallelism        bandwith communication which integrates well with Github
offered by today’s hardware and modern distributed computing           and Travis-CI. Alternatively, we have set-up two mailing lists
models.                                                                for asynchronous communication. Additionally, conference
          II. S OFTWARE D EVELOPMENT P RACTICES                        calls using Skype and periodic plenary meetings complement
                                                                       our communication toolbox and help us to coordinate this
    Although Cx3D has a very compact code base (15 kLOC),              project.
it is able to perform complex simulations like “cortical lamina-
tion”. However, the absence of modern software development               III. M ODERNIZING L EGACY C ODE : E XAMPLES OF THE
practices such as automated tests, continuous integration,                           M ETHODOLOGIES A PPLIED
coding standards compliance, and code reviews hinders a                   High performance and high scalability are the prerequi-
sustainable development process.                                       sites to address ambitious research questions like modeling
This work is licensed under a CC-BY-4.0 license.                       epilepsy. Our efforts in code modernization were driven by the
goal to remove unnecessary overhead and update the software         able to meet a number of different requirements. It is crucial
design to tap the unused potential enabled by the paradigm          that this diversity of the prospective users is already taken
shift to multi and many-core distributed systems.                   into account during the software development process. Incor-
   The Intel Modern Code Development Challenge organized            porating such diversity means that the multidisciplinary project
in 2015 with CERN and Newcastle focused on optimizing se-           team of BioDynaMo must be able to efficiently interact, and
quential C++ brain simulation code provided by the Newcastle        make decisions based on the expertise of each team member.
University in the UK. The contest followed a gamification              In addition to these more scientifically-centered as-
approach where participating students competed against each         pects, also considerable challenges arise from a computa-
other to win an internship at CERN. The ranking was based           tional/technological point of view. First steps towards such ef-
on the runtime of the provided simulation. Using data layout        ficient software implementation have been made in the context
transformations (array of structures to structure of arrays),       of the “Intel Modern Code Developer Challenge” competition.
parallelization with OpenMP, a custom memory allocator              Overall, we believe we have created a collaborative foundation
and Intel Cilk Plus array notation, the winner was able to          for the efficient continuation of the very ambitious software
decrease the runtime by a factor of 320. This clearly shows         development project of BioDynaMo.
the economic potential of code modernization efforts coupled           However, considerable challenges remain in the current
with gamification and encourages to repeat the challenge.           software development process. The verification and validation
   Furthermore, we ported the Java code base to C++. This           of the software is paramount. The recent study of [11] demon-
language is better suited for high performance computing as it      strates the extraordinary risks that arise when the correctness
is compiled to native machine code removing the overhead of         and validity of software tools for scientific research are not
running in a virtual machine and provides the right ecosystem       properly assessed. Moreover, the efficient communication and
for parallelization and optimization. The following iterative       orchestration among the members are crucial components of
porting approach has been chosen. First, a Java class is selected   this international project. We have identified these key aspects
and replaced by its C++ translation. In the second step, this       to require further efforts in parallel to the overall development
C++ class is connected to the remaining Java application.           process.
Finally, the Java/C++ hybrid is compiled and used to execute                             ACKNOWLEDGMENT
a number of tests. If all tests pass, the developer can proceed
                                                                       This work was possible thanks to the support by CERN
with the next iteration by selecting another Java class. On the
                                                                    and CERN openlab Code Modernization program in cooper-
other hand this means that errors, indicated by test failures,
                                                                    ation with Intel; by the Human Green Brain Project (www.
must have been introduced by code changes since the last
                                                                    greenbrainproject.org) through the Engineering and Physical
iteration. Therefore, this procedure significantly simplifies de-
                                                                    Sciences Research Council (EP/K026992/1); by Innopolis
bugging. Although this approach is associated with additional
                                                                    University, and its Service Science and Engineering lab (SSE);
development overhead in connecting classes in C++ to Java,
                                                                    and by SCImPULSE Foundation. The funding institutions had
it gives the benefit of obtaining a runnable system after each
                                                                    no role in the design of the project, decision to publish, or
iteration. Without that additional effort, the first time the C++
                                                                    preparation of the manuscript.
version would be able to execute tests, would be at the very
end, after all classes have been ported. Porting would have                                      R EFERENCES
been a lot easier if every class / function had sufficient unit      [1] F. Zubler and R. Douglas, “A framework for modeling the growth
tests. In this scenario connecting both languages would no               and development of neurons and networks,” Frontiers in computational
                                                                         neuroscience, vol. 3, p. 25, 2009.
longer be required since tests could be executed for each            [2] “Biodynamo code repository on github,” https://github.com/BioDynaMo/
function independently. Testing the whole simulation software            biodynamo.
had another drawback: floating point differences on diverse          [3] “Travis ci,” https://travis-ci.com/.
                                                                     [4] “Google c++ style guide,” https://google.github.io/styleguide/cppguide.
systems amplified over many iterations and were responsible              html.
for test failures although the code was correct. We fixed that       [5] “Biodynamo developers guide,” https://github.com/BioDynaMo/
issue using the math-library crlibm to obtain reproducible               biodynamo/wiki/BioDynaMo-Developers-Guide.
                                                                     [6] “Slack,” https://slack.com/.
results across different environments as suggested in [7].           [7] E. Mcintosh, F. Schmidt, F. de Dinechin et al., “Massive tracking
Setting up the whole development environment and porting                 on heterogeneous platforms,” in 2006 ICAP Conference in Chamonix,
the application took six months. A preliminary performance               France, 2006.
                                                                     [8] R. Bauer, F. Zubler, S. Pfister, A. Hauri, M. Pfeiffer, D. R. Muir, and
benchmark of the single threaded, non vectorized C++ version             R. J. Douglas, “Developmental self-construction and-configuration of
showed promising performance improvements of up to 4.8x                  functional neocortical neuronal networks,” PLOS Comput Biol, vol. 10,
with a median of 1.7x.                                                   no. 12, 2014.
                                                                     [9] J. B. Freund, “Numerical simulation of flowing blood cells,” Annual
                                                                         review of fluid mechanics, vol. 46, pp. 67–95, 2014.
                      IV. C ONCLUSION                               [10] E. M. Izhikevich and G. M. Edelman, “Large-scale model of mam-
  The field of computational biology covers a wide range                 malian thalamocortical systems,” Proceedings of the national academy
                                                                         of sciences, vol. 105, no. 9, pp. 3593–3598, 2008.
of scientific topics, each producing many different scientific      [11] A. Eklund, T. E. Nichols, and H. Knutsson, “Cluster failure: Why
models, such as for instance described by [8], [9] and [10].             fmri inferences for spatial extent have inflated false-positive rates,”
Hence, a general platform for biological research should be              Proceedings of the National Academy of Sciences, 2016.