=Paper=
{{Paper
|id=Vol-2844/ainst5
|storemode=property
|title=Introduction to a Data-driven Analysis Tool of Molecular Dynamics Self-Assembled Lipid Bilayer Trajectories (short paper)
|pdfUrl=https://ceur-ws.org/Vol-2844/ainst5.pdf
|volume=Vol-2844
|authors=Stelios Karozis,Michael Kainourgiakis
|dblpUrl=https://dblp.org/rec/conf/setn/KarozisK20
}}
==Introduction to a Data-driven Analysis Tool of Molecular Dynamics Self-Assembled Lipid Bilayer Trajectories (short paper)==
Introduction to a Data-driven Analysis Tool of Molecular
Dynamics Self-Assembled Lipid Bilayer Trajectories
Stelios Karozis Michael Kainourgiakis
Institute of Nuclear & Radiological Sciences and Institute of Nuclear & Radiological Sciences and
Technology, Energy & Safety, Technology, Energy & Safety,
NCSR "Demokritos" NCSR "Demokritos"
Greece Greece
ABSTRACT (“Trajectory Analyzer and Visualizer”) [3] and pyPcazip [4] are
The in-silico studies reported so far for the representation of the autonomous and were developed for a specific case, thus lacking
structure and the evaluation of the transport properties of lipid generic applicability.
bilayers are in general based on assumptions and approaches that The aforementioned packages, alongside the incorporated tools
simplify the real system and problem. Nevertheless, the structure of MD and MC simulation softwares, are well established and tested
and organization of the lipid bilayers strongly affect transport coef- but they don’t solve the problem of processing the big data pro-
ficients. This is a quite important observation, showing that simula- duction of MS simulations. Machine Learning (ML) algorithms are
tions can be meaningful only when addressing realistic structures, data analytics tools where no equation or pre defined model exists.
mimicking the actual lipid phase system as elaborately as possible. The goal is to deduce (“learn”) the model from the data. ML may
In the current study, a computational tool is presented that uses be useful not only for managing and analyzing the big data of MS
Molecular Dynamics simulations (MD) results of spontaneous self- simulations but also as a new way to study systems and discover
assembly lipid bilayer structures with different oriented and shaped patterns that may lead to insights about the case under investiga-
lipid bilayer, in order to analyze the resulted trajectories, creating a tion. ML has been already used in MS in many different ways from
Machine Learning (ML) ready dataset that can be used in a series post processing, to preparation of input parameters and the error
of ML algorithms, depending the case. The development of the tool reduction of simulation itself [5–9].
is in the alpha stage, where tests are performed, with a planned In the current paper, we introduce a computational tool of ana-
public release in free and open source license. lyzing random oriented lipid bilayers derived from MD trajectories
and creating a dataset ready to be used in ML algorithms. The initial
KEYWORDS data consist of spontaneous self-assembly structures of the lipid
bilayer using MD simulations.
lipid bilayer, Molecular Dynamics, Machine learning
1 INTRODUCTION
2 CAPABILITIES AND IMPLEMENTATION
As molecular simulations (MS) continue to evolve into powerful
computational tool for studying complex biomolecular systems The workflow under discussion consists of three distinct steps; (1)
and the exponential growth of computational power, the systems the analyzing of the MD trajectories, (2) the creation of the ML
under study are becoming more complex. As such, a large amount of ready dataset and (3) the use of the dataset to ML algorithms (see
configurations are produced with more ease that permit to diminish Figure 1).
the uncertainty of the calculated thermodynamics properties. The The tool is written in Python3 programming language and pro-
main tools derive from statistical mechanics, hence the larger the vides a dynamic input interface, that is capable of filling the re-
sample becomes, the more accurate the calculation. quirements of each user case. The user have to describe the atom
On the other hand, the large amount of MS results creates a data groups and the primary analysis for each group. Moreover, the
processing problem in terms of software and hardware capabilities. input interface enables the combination of the results of primary
The hardware problem can be surpassed with modern solutions, analysis in order to calculate secondary properties for the system.
such as distributed data processing systems, or by new software The aforementioned inputs need to be written in python dictionary
implementations that are more efficient in limited hardware in- format.
frastructures. Most MS simulation packages incorporate their own In order to address the problem of different oriented and shaped
post processing tools or suggest the use of open source compatible lipid bilayer, which is the result of self assemblage (see Section 3),
softwares that are sufficient enough for most cases. MDTraj [1] is the tool performs a domain decomposition of the final configura-
used in a range of cases or as basis for other processing software tion and identifies the atoms that belong to the user defined groups.
like TTClust [2], that partition thousands of frames into a limited Each group and domain becomes a sub-system that will be analyzed
number of most dissimilar conformations. Other tools, like TRAVIS as a unique MD system. As such, each MD simulation may create
more than one sub-systems, hence, instances in the final dataset.
By breaking the system to small domains, where the assumption
AINST2020, September 02–04, 2020, Athens, Greece of no curvature, no intersection point etc can be applied, the con-
Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons
License Attribution 4.0 International (CC BY 4.0). formation is treated as an ideal bilayer structure, and a series of
MD analysis tools can be used. The resulted dataset can be used
far from ideal, in terms of shape and orientation, and the properties
are correlated by the local composition, shape, orientation, bilayer
thickness etc. The provided tools of analyzing MD trajectories lack
the functionality to processing random oriented and shaped bilayer
structures. The tool presented in the current paper attempts to
address that problem by decompose the each simulation resulted
conformation in small sub-domains, calculating structural proper-
ties of each sub-domain, such as density profile, order parameter,
radial distribution function, tilt of lipid chains, and producing a
ML ready dataset in order to apply data driven techniques, such
as classification or clustering. The ML techniques will provide a
fast, efficient and unbias way to group the different sub-domains
and it will try to identify and extract the physical meaning of each
resulted group via their properties. That information will lead to
a recommendation of a series of distinct and well defined bilayer
structure that exist simultaneously in the macroscopic the system.
The recommended conformation can be reconstructed and can be
taken into account in future studies of the system.
4 DISCUSSION
Figure 1: Illustration of the workflow process of the pre- The capabilities of the tool serve as a bridge, connecting MD data
sented tool. with structural properties and ML algorithms for general data sci-
ence audiences. The derived measurements constitute a domain
as input to ML algorithms, which enable to patterns’ identification dataset, aiming to feed ML algorithms and (i) explore patterns that
and gain insights for large and complex bilayer structures. may emerge by applying unsupervised learning algorithms or (ii)
The tool can load efficiently trajectory and/or topology data build a model that predicts a property of interest. Moreover, the
from the format used in GROMACS [10] MD simulation tool and calculated properties can be used as supplement data to a larger
use many post-process tools that GROMACS provides, alongside dataset. The tool stands out for the novel approach of examining
customized calculation (primary or secondary) in order to calculate the system as a series of sub-system, thus surpassing the problems
a series of properties for each sub-system. The structural charac- and limitations of analyzing complex lipid bilayer structures.
teristics that are calculated for each sub-system, are the peaks of The tool’s development state is an alpha version, where tests and
density profile, the tilt of the order part of the order part of lipid debugging are performed. As future work, the outcome and results
chain, the peaks of radial distribution function of pairs of lipid of the a case study is planned to be used, alongside the first public
groups and the order parameter of lipid chains. release of the code under free and open source license. The tool is
hosted at: https://mssg.ipta.demokritos.gr/gitlab/skarozis/toobba
3 CASE STUDY
ACKNOWLEDGMENTS
The orientation of the lipid bilayer can be studied through MD
calculations. However, such treatments are based on the a-priori This research is co-financed by Greece and the European Union
placement of the lipid molecules in appropriate positions, in order (European Social Fund - ESF) through the Operational Programme
to form a periodical system with appropriately oriented hydrophilic «Human Resources Development, Education and Lifelong Learn-
chains and hydrophobic groups. Despite the fact that the aforemen- ing» in the context of the project "Reinforcement of Postdoctoral
tioned formation saves a substantial amount of simulation time, Researchers - 2nd Cycle" (MIS-5033021), implemented by the State
it only represents a simplified and ideal approximation of the for- Scholarships Foundation (IKY).
mation in equilibrium and does not ensure that its structural and
dynamical properties are simulating accordingly the real/natural REFERENCES
lipid phase of the system under study. [1] R. T. McGibbon, K. A. Beauchamp, M. P. Harrigan, C. Klein, J. M. Swails, C. X.
Hernández, C. R. Schwantes, L.-P. Wang, T. J. Lane, and V. S. Pande, “MDTraj:
Other approaches [11] recreate the structure of the lipid bilayer A Modern Open Library for the Analysis of Molecular Dynamics Trajectories,”
using MD with random initial configurations of the molecules. This Biophysical Journal, vol. 109, pp. 1528–1532, oct 2015.
treatment aims to study the dynamics of the system while it moves [2] T. Tubiana, J.-C. Carvaillo, Y. Boulard, and S. Bressanelli, “TTClust: A Versatile
Molecular Simulation Trajectory Clustering Program with Graphical Summaries,”
towards equilibrium and to the spontaneous self-assembly of the Journal of Chemical Information and Modeling, vol. 58, pp. 2178–2182, nov 2018.
single lipids into a bilayer, as well as simulate more realistic con- [3] M. Brehm, M. Thomas, S. Gehrke, and B. Kirchner, “TRAVIS—A free analyzer for
trajectories from molecular simulation,” The Journal of Chemical Physics, vol. 152,
formations of minimum energy. Thus, any approximation based on p. 164105, apr 2020.
the a-priori placement of the lipids will be eliminated. Due to the [4] A. Shkurti, R. Goni, P. Andrio, E. Breitmoser, I. Bethune, M. Orozco, and C. A.
randomness of the initial configuration which affect the resulted Laughton, “pyPcazip: A PCA-based toolkit for compression and analysis of molec-
ular simulation data,” SoftwareX, vol. 5, pp. 44–50, 2016.
structure, a sufficient sampling of self-assembled systems need to be [5] E. Swann, B. Sun, D. M. Cleland, and A. S. Barnard, “Representing molecular and
produced (102 order of magnitude). All of the resulted systems are materials data for unsupervised machine learning,” Molecular Simulation, vol. 44,
pp. 905–920, jul 2018. [9] F. Noé, A. Tkatchenko, K.-R. Müller, and C. Clementi, “Machine Learning for
[6] B. K. Carpenter, G. S. Ezra, S. C. Farantos, Z. C. Kramer, and S. Wiggins, “Empir- Molecular Simulation,” Annual Review of Physical Chemistry, vol. 71, pp. 361–390,
ical Classification of Trajectory Data: An Opportunity for the Use of Machine apr 2020.
Learning in Molecular Dynamics,” The Journal of Physical Chemistry B, vol. 122, [10] M. J. Abraham, T. Murtola, R. Schulz, S. Páll, J. C. Smith, B. Hess, and E. Lin-
pp. 3230–3241, apr 2018. dahl, “GROMACS: High performance molecular simulations through multi-level
[7] M. Haghighatlari and J. Hachmann, “Advances of machine learning in molecular parallelism from laptops to supercomputers,” SoftwareX, vol. 1-2, pp. 19–25, sep
modeling and simulation,” Current Opinion in Chemical Engineering, vol. 23, 2015.
pp. 51–57, 2019. [11] S. N. Karozis, E. I. Mavroudakis, G. C. Charalambopoulou, and M. E. Kainour-
[8] H. Sidky, W. Chen, and A. L. Ferguson, “Machine learning for collective variable giakis, “Molecular simulations of self-assembled ceramide bilayers: comparison
discovery and enhanced sampling in biomolecular simulation,” Molecular Physics, of structural and barrier properties,” Molecular Simulation, vol. 46, pp. 323–331,
vol. 118, p. e1737742, mar 2020. mar 2020.