=Paper= {{Paper |id=Vol-2844/ainst5 |storemode=property |title=Introduction to a Data-driven Analysis Tool of Molecular Dynamics Self-Assembled Lipid Bilayer Trajectories (short paper) |pdfUrl=https://ceur-ws.org/Vol-2844/ainst5.pdf |volume=Vol-2844 |authors=Stelios Karozis,Michael Kainourgiakis |dblpUrl=https://dblp.org/rec/conf/setn/KarozisK20 }} ==Introduction to a Data-driven Analysis Tool of Molecular Dynamics Self-Assembled Lipid Bilayer Trajectories (short paper)== https://ceur-ws.org/Vol-2844/ainst5.pdf
          Introduction to a Data-driven Analysis Tool of Molecular
            Dynamics Self-Assembled Lipid Bilayer Trajectories
                                Stelios Karozis                                                        Michael Kainourgiakis
           Institute of Nuclear & Radiological Sciences and                                Institute of Nuclear & Radiological Sciences and
                     Technology, Energy & Safety,                                                    Technology, Energy & Safety,
                          NCSR "Demokritos"                                                               NCSR "Demokritos"
                                Greece                                                                          Greece
ABSTRACT                                                                               (“Trajectory Analyzer and Visualizer”) [3] and pyPcazip [4] are
The in-silico studies reported so far for the representation of the                    autonomous and were developed for a specific case, thus lacking
structure and the evaluation of the transport properties of lipid                      generic applicability.
bilayers are in general based on assumptions and approaches that                          The aforementioned packages, alongside the incorporated tools
simplify the real system and problem. Nevertheless, the structure                      of MD and MC simulation softwares, are well established and tested
and organization of the lipid bilayers strongly affect transport coef-                 but they don’t solve the problem of processing the big data pro-
ficients. This is a quite important observation, showing that simula-                  duction of MS simulations. Machine Learning (ML) algorithms are
tions can be meaningful only when addressing realistic structures,                     data analytics tools where no equation or pre defined model exists.
mimicking the actual lipid phase system as elaborately as possible.                    The goal is to deduce (“learn”) the model from the data. ML may
    In the current study, a computational tool is presented that uses                  be useful not only for managing and analyzing the big data of MS
Molecular Dynamics simulations (MD) results of spontaneous self-                       simulations but also as a new way to study systems and discover
assembly lipid bilayer structures with different oriented and shaped                   patterns that may lead to insights about the case under investiga-
lipid bilayer, in order to analyze the resulted trajectories, creating a               tion. ML has been already used in MS in many different ways from
Machine Learning (ML) ready dataset that can be used in a series                       post processing, to preparation of input parameters and the error
of ML algorithms, depending the case. The development of the tool                      reduction of simulation itself [5–9].
is in the alpha stage, where tests are performed, with a planned                          In the current paper, we introduce a computational tool of ana-
public release in free and open source license.                                        lyzing random oriented lipid bilayers derived from MD trajectories
                                                                                       and creating a dataset ready to be used in ML algorithms. The initial
KEYWORDS                                                                               data consist of spontaneous self-assembly structures of the lipid
                                                                                       bilayer using MD simulations.
lipid bilayer, Molecular Dynamics, Machine learning

1    INTRODUCTION
                                                                                       2   CAPABILITIES AND IMPLEMENTATION
As molecular simulations (MS) continue to evolve into powerful
computational tool for studying complex biomolecular systems                           The workflow under discussion consists of three distinct steps; (1)
and the exponential growth of computational power, the systems                         the analyzing of the MD trajectories, (2) the creation of the ML
under study are becoming more complex. As such, a large amount of                      ready dataset and (3) the use of the dataset to ML algorithms (see
configurations are produced with more ease that permit to diminish                     Figure 1).
the uncertainty of the calculated thermodynamics properties. The                          The tool is written in Python3 programming language and pro-
main tools derive from statistical mechanics, hence the larger the                     vides a dynamic input interface, that is capable of filling the re-
sample becomes, the more accurate the calculation.                                     quirements of each user case. The user have to describe the atom
   On the other hand, the large amount of MS results creates a data                    groups and the primary analysis for each group. Moreover, the
processing problem in terms of software and hardware capabilities.                     input interface enables the combination of the results of primary
The hardware problem can be surpassed with modern solutions,                           analysis in order to calculate secondary properties for the system.
such as distributed data processing systems, or by new software                        The aforementioned inputs need to be written in python dictionary
implementations that are more efficient in limited hardware in-                        format.
frastructures. Most MS simulation packages incorporate their own                          In order to address the problem of different oriented and shaped
post processing tools or suggest the use of open source compatible                     lipid bilayer, which is the result of self assemblage (see Section 3),
softwares that are sufficient enough for most cases. MDTraj [1] is                     the tool performs a domain decomposition of the final configura-
used in a range of cases or as basis for other processing software                     tion and identifies the atoms that belong to the user defined groups.
like TTClust [2], that partition thousands of frames into a limited                    Each group and domain becomes a sub-system that will be analyzed
number of most dissimilar conformations. Other tools, like TRAVIS                      as a unique MD system. As such, each MD simulation may create
                                                                                       more than one sub-systems, hence, instances in the final dataset.
                                                                                       By breaking the system to small domains, where the assumption
AINST2020, September 02–04, 2020, Athens, Greece                                       of no curvature, no intersection point etc can be applied, the con-
Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons
License Attribution 4.0 International (CC BY 4.0).                                     formation is treated as an ideal bilayer structure, and a series of
                                                                                       MD analysis tools can be used. The resulted dataset can be used
                                                                         far from ideal, in terms of shape and orientation, and the properties
                                                                         are correlated by the local composition, shape, orientation, bilayer
                                                                         thickness etc. The provided tools of analyzing MD trajectories lack
                                                                         the functionality to processing random oriented and shaped bilayer
                                                                         structures. The tool presented in the current paper attempts to
                                                                         address that problem by decompose the each simulation resulted
                                                                         conformation in small sub-domains, calculating structural proper-
                                                                         ties of each sub-domain, such as density profile, order parameter,
                                                                         radial distribution function, tilt of lipid chains, and producing a
                                                                         ML ready dataset in order to apply data driven techniques, such
                                                                         as classification or clustering. The ML techniques will provide a
                                                                         fast, efficient and unbias way to group the different sub-domains
                                                                         and it will try to identify and extract the physical meaning of each
                                                                         resulted group via their properties. That information will lead to
                                                                         a recommendation of a series of distinct and well defined bilayer
                                                                         structure that exist simultaneously in the macroscopic the system.
                                                                         The recommended conformation can be reconstructed and can be
                                                                         taken into account in future studies of the system.

                                                                         4    DISCUSSION
Figure 1: Illustration of the workflow process of the pre-               The capabilities of the tool serve as a bridge, connecting MD data
sented tool.                                                             with structural properties and ML algorithms for general data sci-
                                                                         ence audiences. The derived measurements constitute a domain
as input to ML algorithms, which enable to patterns’ identification      dataset, aiming to feed ML algorithms and (i) explore patterns that
and gain insights for large and complex bilayer structures.              may emerge by applying unsupervised learning algorithms or (ii)
   The tool can load efficiently trajectory and/or topology data         build a model that predicts a property of interest. Moreover, the
from the format used in GROMACS [10] MD simulation tool and              calculated properties can be used as supplement data to a larger
use many post-process tools that GROMACS provides, alongside             dataset. The tool stands out for the novel approach of examining
customized calculation (primary or secondary) in order to calculate      the system as a series of sub-system, thus surpassing the problems
a series of properties for each sub-system. The structural charac-       and limitations of analyzing complex lipid bilayer structures.
teristics that are calculated for each sub-system, are the peaks of         The tool’s development state is an alpha version, where tests and
density profile, the tilt of the order part of the order part of lipid   debugging are performed. As future work, the outcome and results
chain, the peaks of radial distribution function of pairs of lipid       of the a case study is planned to be used, alongside the first public
groups and the order parameter of lipid chains.                          release of the code under free and open source license. The tool is
                                                                         hosted at: https://mssg.ipta.demokritos.gr/gitlab/skarozis/toobba
3   CASE STUDY
                                                                         ACKNOWLEDGMENTS
The orientation of the lipid bilayer can be studied through MD
calculations. However, such treatments are based on the a-priori         This research is co-financed by Greece and the European Union
placement of the lipid molecules in appropriate positions, in order      (European Social Fund - ESF) through the Operational Programme
to form a periodical system with appropriately oriented hydrophilic      «Human Resources Development, Education and Lifelong Learn-
chains and hydrophobic groups. Despite the fact that the aforemen-       ing» in the context of the project "Reinforcement of Postdoctoral
tioned formation saves a substantial amount of simulation time,          Researchers - 2nd Cycle" (MIS-5033021), implemented by the State
it only represents a simplified and ideal approximation of the for-      Scholarships Foundation (IKY).
mation in equilibrium and does not ensure that its structural and
dynamical properties are simulating accordingly the real/natural         REFERENCES
lipid phase of the system under study.                                    [1] R. T. McGibbon, K. A. Beauchamp, M. P. Harrigan, C. Klein, J. M. Swails, C. X.
                                                                              Hernández, C. R. Schwantes, L.-P. Wang, T. J. Lane, and V. S. Pande, “MDTraj:
    Other approaches [11] recreate the structure of the lipid bilayer         A Modern Open Library for the Analysis of Molecular Dynamics Trajectories,”
using MD with random initial configurations of the molecules. This            Biophysical Journal, vol. 109, pp. 1528–1532, oct 2015.
treatment aims to study the dynamics of the system while it moves         [2] T. Tubiana, J.-C. Carvaillo, Y. Boulard, and S. Bressanelli, “TTClust: A Versatile
                                                                              Molecular Simulation Trajectory Clustering Program with Graphical Summaries,”
towards equilibrium and to the spontaneous self-assembly of the               Journal of Chemical Information and Modeling, vol. 58, pp. 2178–2182, nov 2018.
single lipids into a bilayer, as well as simulate more realistic con-     [3] M. Brehm, M. Thomas, S. Gehrke, and B. Kirchner, “TRAVIS—A free analyzer for
                                                                              trajectories from molecular simulation,” The Journal of Chemical Physics, vol. 152,
formations of minimum energy. Thus, any approximation based on                p. 164105, apr 2020.
the a-priori placement of the lipids will be eliminated. Due to the       [4] A. Shkurti, R. Goni, P. Andrio, E. Breitmoser, I. Bethune, M. Orozco, and C. A.
randomness of the initial configuration which affect the resulted             Laughton, “pyPcazip: A PCA-based toolkit for compression and analysis of molec-
                                                                              ular simulation data,” SoftwareX, vol. 5, pp. 44–50, 2016.
structure, a sufficient sampling of self-assembled systems need to be     [5] E. Swann, B. Sun, D. M. Cleland, and A. S. Barnard, “Representing molecular and
produced (102 order of magnitude). All of the resulted systems are            materials data for unsupervised machine learning,” Molecular Simulation, vol. 44,
    pp. 905–920, jul 2018.                                                                [9] F. Noé, A. Tkatchenko, K.-R. Müller, and C. Clementi, “Machine Learning for
[6] B. K. Carpenter, G. S. Ezra, S. C. Farantos, Z. C. Kramer, and S. Wiggins, “Empir-        Molecular Simulation,” Annual Review of Physical Chemistry, vol. 71, pp. 361–390,
    ical Classification of Trajectory Data: An Opportunity for the Use of Machine             apr 2020.
    Learning in Molecular Dynamics,” The Journal of Physical Chemistry B, vol. 122,      [10] M. J. Abraham, T. Murtola, R. Schulz, S. Páll, J. C. Smith, B. Hess, and E. Lin-
    pp. 3230–3241, apr 2018.                                                                  dahl, “GROMACS: High performance molecular simulations through multi-level
[7] M. Haghighatlari and J. Hachmann, “Advances of machine learning in molecular              parallelism from laptops to supercomputers,” SoftwareX, vol. 1-2, pp. 19–25, sep
    modeling and simulation,” Current Opinion in Chemical Engineering, vol. 23,               2015.
    pp. 51–57, 2019.                                                                     [11] S. N. Karozis, E. I. Mavroudakis, G. C. Charalambopoulou, and M. E. Kainour-
[8] H. Sidky, W. Chen, and A. L. Ferguson, “Machine learning for collective variable          giakis, “Molecular simulations of self-assembled ceramide bilayers: comparison
    discovery and enhanced sampling in biomolecular simulation,” Molecular Physics,           of structural and barrier properties,” Molecular Simulation, vol. 46, pp. 323–331,
    vol. 118, p. e1737742, mar 2020.                                                          mar 2020.