Introduction to a Data-driven Analysis Tool of Molecular Dynamics Self-Assembled Lipid Bilayer Trajectories Stelios Karozis Michael Kainourgiakis Institute of Nuclear & Radiological Sciences and Institute of Nuclear & Radiological Sciences and Technology, Energy & Safety, Technology, Energy & Safety, NCSR "Demokritos" NCSR "Demokritos" Greece Greece ABSTRACT (“Trajectory Analyzer and Visualizer”) [3] and pyPcazip [4] are The in-silico studies reported so far for the representation of the autonomous and were developed for a specific case, thus lacking structure and the evaluation of the transport properties of lipid generic applicability. bilayers are in general based on assumptions and approaches that The aforementioned packages, alongside the incorporated tools simplify the real system and problem. Nevertheless, the structure of MD and MC simulation softwares, are well established and tested and organization of the lipid bilayers strongly affect transport coef- but they don’t solve the problem of processing the big data pro- ficients. This is a quite important observation, showing that simula- duction of MS simulations. Machine Learning (ML) algorithms are tions can be meaningful only when addressing realistic structures, data analytics tools where no equation or pre defined model exists. mimicking the actual lipid phase system as elaborately as possible. The goal is to deduce (“learn”) the model from the data. ML may In the current study, a computational tool is presented that uses be useful not only for managing and analyzing the big data of MS Molecular Dynamics simulations (MD) results of spontaneous self- simulations but also as a new way to study systems and discover assembly lipid bilayer structures with different oriented and shaped patterns that may lead to insights about the case under investiga- lipid bilayer, in order to analyze the resulted trajectories, creating a tion. ML has been already used in MS in many different ways from Machine Learning (ML) ready dataset that can be used in a series post processing, to preparation of input parameters and the error of ML algorithms, depending the case. The development of the tool reduction of simulation itself [5–9]. is in the alpha stage, where tests are performed, with a planned In the current paper, we introduce a computational tool of ana- public release in free and open source license. lyzing random oriented lipid bilayers derived from MD trajectories and creating a dataset ready to be used in ML algorithms. The initial KEYWORDS data consist of spontaneous self-assembly structures of the lipid bilayer using MD simulations. lipid bilayer, Molecular Dynamics, Machine learning 1 INTRODUCTION 2 CAPABILITIES AND IMPLEMENTATION As molecular simulations (MS) continue to evolve into powerful computational tool for studying complex biomolecular systems The workflow under discussion consists of three distinct steps; (1) and the exponential growth of computational power, the systems the analyzing of the MD trajectories, (2) the creation of the ML under study are becoming more complex. As such, a large amount of ready dataset and (3) the use of the dataset to ML algorithms (see configurations are produced with more ease that permit to diminish Figure 1). the uncertainty of the calculated thermodynamics properties. The The tool is written in Python3 programming language and pro- main tools derive from statistical mechanics, hence the larger the vides a dynamic input interface, that is capable of filling the re- sample becomes, the more accurate the calculation. quirements of each user case. The user have to describe the atom On the other hand, the large amount of MS results creates a data groups and the primary analysis for each group. Moreover, the processing problem in terms of software and hardware capabilities. input interface enables the combination of the results of primary The hardware problem can be surpassed with modern solutions, analysis in order to calculate secondary properties for the system. such as distributed data processing systems, or by new software The aforementioned inputs need to be written in python dictionary implementations that are more efficient in limited hardware in- format. frastructures. Most MS simulation packages incorporate their own In order to address the problem of different oriented and shaped post processing tools or suggest the use of open source compatible lipid bilayer, which is the result of self assemblage (see Section 3), softwares that are sufficient enough for most cases. MDTraj [1] is the tool performs a domain decomposition of the final configura- used in a range of cases or as basis for other processing software tion and identifies the atoms that belong to the user defined groups. like TTClust [2], that partition thousands of frames into a limited Each group and domain becomes a sub-system that will be analyzed number of most dissimilar conformations. Other tools, like TRAVIS as a unique MD system. As such, each MD simulation may create more than one sub-systems, hence, instances in the final dataset. By breaking the system to small domains, where the assumption AINST2020, September 02–04, 2020, Athens, Greece of no curvature, no intersection point etc can be applied, the con- Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). formation is treated as an ideal bilayer structure, and a series of MD analysis tools can be used. The resulted dataset can be used far from ideal, in terms of shape and orientation, and the properties are correlated by the local composition, shape, orientation, bilayer thickness etc. The provided tools of analyzing MD trajectories lack the functionality to processing random oriented and shaped bilayer structures. The tool presented in the current paper attempts to address that problem by decompose the each simulation resulted conformation in small sub-domains, calculating structural proper- ties of each sub-domain, such as density profile, order parameter, radial distribution function, tilt of lipid chains, and producing a ML ready dataset in order to apply data driven techniques, such as classification or clustering. The ML techniques will provide a fast, efficient and unbias way to group the different sub-domains and it will try to identify and extract the physical meaning of each resulted group via their properties. That information will lead to a recommendation of a series of distinct and well defined bilayer structure that exist simultaneously in the macroscopic the system. The recommended conformation can be reconstructed and can be taken into account in future studies of the system. 4 DISCUSSION Figure 1: Illustration of the workflow process of the pre- The capabilities of the tool serve as a bridge, connecting MD data sented tool. with structural properties and ML algorithms for general data sci- ence audiences. The derived measurements constitute a domain as input to ML algorithms, which enable to patterns’ identification dataset, aiming to feed ML algorithms and (i) explore patterns that and gain insights for large and complex bilayer structures. may emerge by applying unsupervised learning algorithms or (ii) The tool can load efficiently trajectory and/or topology data build a model that predicts a property of interest. Moreover, the from the format used in GROMACS [10] MD simulation tool and calculated properties can be used as supplement data to a larger use many post-process tools that GROMACS provides, alongside dataset. The tool stands out for the novel approach of examining customized calculation (primary or secondary) in order to calculate the system as a series of sub-system, thus surpassing the problems a series of properties for each sub-system. The structural charac- and limitations of analyzing complex lipid bilayer structures. teristics that are calculated for each sub-system, are the peaks of The tool’s development state is an alpha version, where tests and density profile, the tilt of the order part of the order part of lipid debugging are performed. As future work, the outcome and results chain, the peaks of radial distribution function of pairs of lipid of the a case study is planned to be used, alongside the first public groups and the order parameter of lipid chains. release of the code under free and open source license. The tool is hosted at: https://mssg.ipta.demokritos.gr/gitlab/skarozis/toobba 3 CASE STUDY ACKNOWLEDGMENTS The orientation of the lipid bilayer can be studied through MD calculations. However, such treatments are based on the a-priori This research is co-financed by Greece and the European Union placement of the lipid molecules in appropriate positions, in order (European Social Fund - ESF) through the Operational Programme to form a periodical system with appropriately oriented hydrophilic «Human Resources Development, Education and Lifelong Learn- chains and hydrophobic groups. Despite the fact that the aforemen- ing» in the context of the project "Reinforcement of Postdoctoral tioned formation saves a substantial amount of simulation time, Researchers - 2nd Cycle" (MIS-5033021), implemented by the State it only represents a simplified and ideal approximation of the for- Scholarships Foundation (IKY). mation in equilibrium and does not ensure that its structural and dynamical properties are simulating accordingly the real/natural REFERENCES lipid phase of the system under study. [1] R. T. McGibbon, K. A. Beauchamp, M. P. Harrigan, C. Klein, J. M. Swails, C. X. Hernández, C. R. Schwantes, L.-P. Wang, T. J. Lane, and V. S. Pande, “MDTraj: Other approaches [11] recreate the structure of the lipid bilayer A Modern Open Library for the Analysis of Molecular Dynamics Trajectories,” using MD with random initial configurations of the molecules. This Biophysical Journal, vol. 109, pp. 1528–1532, oct 2015. treatment aims to study the dynamics of the system while it moves [2] T. Tubiana, J.-C. Carvaillo, Y. Boulard, and S. Bressanelli, “TTClust: A Versatile Molecular Simulation Trajectory Clustering Program with Graphical Summaries,” towards equilibrium and to the spontaneous self-assembly of the Journal of Chemical Information and Modeling, vol. 58, pp. 2178–2182, nov 2018. single lipids into a bilayer, as well as simulate more realistic con- [3] M. Brehm, M. Thomas, S. Gehrke, and B. Kirchner, “TRAVIS—A free analyzer for trajectories from molecular simulation,” The Journal of Chemical Physics, vol. 152, formations of minimum energy. Thus, any approximation based on p. 164105, apr 2020. the a-priori placement of the lipids will be eliminated. Due to the [4] A. Shkurti, R. Goni, P. Andrio, E. Breitmoser, I. Bethune, M. Orozco, and C. A. randomness of the initial configuration which affect the resulted Laughton, “pyPcazip: A PCA-based toolkit for compression and analysis of molec- ular simulation data,” SoftwareX, vol. 5, pp. 44–50, 2016. structure, a sufficient sampling of self-assembled systems need to be [5] E. Swann, B. Sun, D. M. Cleland, and A. S. Barnard, “Representing molecular and produced (102 order of magnitude). All of the resulted systems are materials data for unsupervised machine learning,” Molecular Simulation, vol. 44, pp. 905–920, jul 2018. [9] F. Noé, A. Tkatchenko, K.-R. Müller, and C. Clementi, “Machine Learning for [6] B. K. Carpenter, G. S. Ezra, S. C. Farantos, Z. C. Kramer, and S. Wiggins, “Empir- Molecular Simulation,” Annual Review of Physical Chemistry, vol. 71, pp. 361–390, ical Classification of Trajectory Data: An Opportunity for the Use of Machine apr 2020. Learning in Molecular Dynamics,” The Journal of Physical Chemistry B, vol. 122, [10] M. J. Abraham, T. Murtola, R. Schulz, S. Páll, J. C. Smith, B. Hess, and E. Lin- pp. 3230–3241, apr 2018. dahl, “GROMACS: High performance molecular simulations through multi-level [7] M. Haghighatlari and J. Hachmann, “Advances of machine learning in molecular parallelism from laptops to supercomputers,” SoftwareX, vol. 1-2, pp. 19–25, sep modeling and simulation,” Current Opinion in Chemical Engineering, vol. 23, 2015. pp. 51–57, 2019. [11] S. N. Karozis, E. I. Mavroudakis, G. C. Charalambopoulou, and M. E. Kainour- [8] H. Sidky, W. Chen, and A. L. Ferguson, “Machine learning for collective variable giakis, “Molecular simulations of self-assembled ceramide bilayers: comparison discovery and enhanced sampling in biomolecular simulation,” Molecular Physics, of structural and barrier properties,” Molecular Simulation, vol. 46, pp. 323–331, vol. 118, p. e1737742, mar 2020. mar 2020.