<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Introduction to a Data-driven Analysis Tool of Molecular Dynamics Self-Assembled Lipid Bilayer Trajectories</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Stelios Karozis</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Michael Kainourgiakis</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Institute of Nuclear &amp; Radiological Sciences and, Technology, Energy &amp; Safety, NCSR "Demokritos"</institution>
          ,
          <country country="GR">Greece</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>The in-silico studies reported so far for the representation of the structure and the evaluation of the transport properties of lipid bilayers are in general based on assumptions and approaches that simplify the real system and problem. Nevertheless, the structure and organization of the lipid bilayers strongly afect transport coefifcients. This is a quite important observation, showing that simulations can be meaningful only when addressing realistic structures, mimicking the actual lipid phase system as elaborately as possible. In the current study, a computational tool is presented that uses Molecular Dynamics simulations (MD) results of spontaneous selfassembly lipid bilayer structures with diferent oriented and shaped lipid bilayer, in order to analyze the resulted trajectories, creating a Machine Learning (ML) ready dataset that can be used in a series of ML algorithms, depending the case. The development of the tool is in the alpha stage, where tests are performed, with a planned public release in free and open source license.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>INTRODUCTION</title>
      <p>As molecular simulations (MS) continue to evolve into powerful
computational tool for studying complex biomolecular systems
and the exponential growth of computational power, the systems
under study are becoming more complex. As such, a large amount of
configurations are produced with more ease that permit to diminish
the uncertainty of the calculated thermodynamics properties. The
main tools derive from statistical mechanics, hence the larger the
sample becomes, the more accurate the calculation.</p>
      <p>
        On the other hand, the large amount of MS results creates a data
processing problem in terms of software and hardware capabilities.
The hardware problem can be surpassed with modern solutions,
such as distributed data processing systems, or by new software
implementations that are more eficient in limited hardware
infrastructures. Most MS simulation packages incorporate their own
post processing tools or suggest the use of open source compatible
softwares that are suficient enough for most cases. MDTraj [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] is
used in a range of cases or as basis for other processing software
like TTClust [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], that partition thousands of frames into a limited
number of most dissimilar conformations. Other tools, like TRAVIS
(“Trajectory Analyzer and Visualizer”) [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] and pyPcazip [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] are
autonomous and were developed for a specific case, thus lacking
generic applicability.
      </p>
      <p>
        The aforementioned packages, alongside the incorporated tools
of MD and MC simulation softwares, are well established and tested
but they don’t solve the problem of processing the big data
production of MS simulations. Machine Learning (ML) algorithms are
data analytics tools where no equation or pre defined model exists.
The goal is to deduce (“learn”) the model from the data. ML may
be useful not only for managing and analyzing the big data of MS
simulations but also as a new way to study systems and discover
patterns that may lead to insights about the case under
investigation. ML has been already used in MS in many diferent ways from
post processing, to preparation of input parameters and the error
reduction of simulation itself [
        <xref ref-type="bibr" rid="ref5 ref6 ref7 ref8 ref9">5–9</xref>
        ].
      </p>
      <p>In the current paper, we introduce a computational tool of
analyzing random oriented lipid bilayers derived from MD trajectories
and creating a dataset ready to be used in ML algorithms. The initial
data consist of spontaneous self-assembly structures of the lipid
bilayer using MD simulations.
2</p>
    </sec>
    <sec id="sec-2">
      <title>CAPABILITIES AND IMPLEMENTATION</title>
      <p>The workflow under discussion consists of three distinct steps; (1)
the analyzing of the MD trajectories, (2) the creation of the ML
ready dataset and (3) the use of the dataset to ML algorithms (see
Figure 1).</p>
      <p>The tool is written in Python3 programming language and
provides a dynamic input interface, that is capable of filling the
requirements of each user case. The user have to describe the atom
groups and the primary analysis for each group. Moreover, the
input interface enables the combination of the results of primary
analysis in order to calculate secondary properties for the system.
The aforementioned inputs need to be written in python dictionary
format.</p>
      <p>In order to address the problem of diferent oriented and shaped
lipid bilayer, which is the result of self assemblage (see Section 3),
the tool performs a domain decomposition of the final
configuration and identifies the atoms that belong to the user defined groups.
Each group and domain becomes a sub-system that will be analyzed
as a unique MD system. As such, each MD simulation may create
more than one sub-systems, hence, instances in the final dataset.
By breaking the system to small domains, where the assumption
of no curvature, no intersection point etc can be applied, the
conformation is treated as an ideal bilayer structure, and a series of
MD analysis tools can be used. The resulted dataset can be used
as input to ML algorithms, which enable to patterns’ identification
and gain insights for large and complex bilayer structures.</p>
      <p>
        The tool can load eficiently trajectory and/or topology data
from the format used in GROMACS [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] MD simulation tool and
use many post-process tools that GROMACS provides, alongside
customized calculation (primary or secondary) in order to calculate
a series of properties for each sub-system. The structural
characteristics that are calculated for each sub-system, are the peaks of
density profile, the tilt of the order part of the order part of lipid
chain, the peaks of radial distribution function of pairs of lipid
groups and the order parameter of lipid chains.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3 CASE STUDY</title>
      <p>The orientation of the lipid bilayer can be studied through MD
calculations. However, such treatments are based on the a-priori
placement of the lipid molecules in appropriate positions, in order
to form a periodical system with appropriately oriented hydrophilic
chains and hydrophobic groups. Despite the fact that the
aforementioned formation saves a substantial amount of simulation time,
it only represents a simplified and ideal approximation of the
formation in equilibrium and does not ensure that its structural and
dynamical properties are simulating accordingly the real/natural
lipid phase of the system under study.</p>
      <p>
        Other approaches [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] recreate the structure of the lipid bilayer
using MD with random initial configurations of the molecules. This
treatment aims to study the dynamics of the system while it moves
towards equilibrium and to the spontaneous self-assembly of the
single lipids into a bilayer, as well as simulate more realistic
conformations of minimum energy. Thus, any approximation based on
the a-priori placement of the lipids will be eliminated. Due to the
randomness of the initial configuration which afect the resulted
structure, a suficient sampling of self-assembled systems need to be
produced (102 order of magnitude). All of the resulted systems are
far from ideal, in terms of shape and orientation, and the properties
are correlated by the local composition, shape, orientation, bilayer
thickness etc. The provided tools of analyzing MD trajectories lack
the functionality to processing random oriented and shaped bilayer
structures. The tool presented in the current paper attempts to
address that problem by decompose the each simulation resulted
conformation in small sub-domains, calculating structural
properties of each sub-domain, such as density profile, order parameter,
radial distribution function, tilt of lipid chains, and producing a
ML ready dataset in order to apply data driven techniques, such
as classification or clustering. The ML techniques will provide a
fast, eficient and unbias way to group the diferent sub-domains
and it will try to identify and extract the physical meaning of each
resulted group via their properties. That information will lead to
a recommendation of a series of distinct and well defined bilayer
structure that exist simultaneously in the macroscopic the system.
The recommended conformation can be reconstructed and can be
taken into account in future studies of the system.
      </p>
    </sec>
    <sec id="sec-4">
      <title>4 DISCUSSION</title>
      <p>The capabilities of the tool serve as a bridge, connecting MD data
with structural properties and ML algorithms for general data
science audiences. The derived measurements constitute a domain
dataset, aiming to feed ML algorithms and (i) explore patterns that
may emerge by applying unsupervised learning algorithms or (ii)
build a model that predicts a property of interest. Moreover, the
calculated properties can be used as supplement data to a larger
dataset. The tool stands out for the novel approach of examining
the system as a series of sub-system, thus surpassing the problems
and limitations of analyzing complex lipid bilayer structures.</p>
      <p>The tool’s development state is an alpha version, where tests and
debugging are performed. As future work, the outcome and results
of the a case study is planned to be used, alongside the first public
release of the code under free and open source license. The tool is
hosted at: https://mssg.ipta.demokritos.gr/gitlab/skarozis/toobba</p>
    </sec>
    <sec id="sec-5">
      <title>ACKNOWLEDGMENTS</title>
      <p>This research is co-financed by Greece and the European Union
(European Social Fund - ESF) through the Operational Programme
«Human Resources Development, Education and Lifelong
Learning» in the context of the project "Reinforcement of Postdoctoral
Researchers - 2nd Cycle" (MIS-5033021), implemented by the State
Scholarships Foundation (IKY).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>R. T.</given-names>
            <surname>McGibbon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K. A.</given-names>
            <surname>Beauchamp</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. P.</given-names>
            <surname>Harrigan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Klein</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. M.</given-names>
            <surname>Swails</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. X.</given-names>
            <surname>Hernández</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. R.</given-names>
            <surname>Schwantes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.-P.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. J.</given-names>
            <surname>Lane</surname>
          </string-name>
          , and
          <string-name>
            <given-names>V. S.</given-names>
            <surname>Pande</surname>
          </string-name>
          , “
          <article-title>MDTraj: A Modern Open Library for the Analysis of Molecular Dynamics Trajectories,”</article-title>
          <source>Biophysical Journal</source>
          , vol.
          <volume>109</volume>
          , pp.
          <fpage>1528</fpage>
          -
          <lpage>1532</lpage>
          , oct
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>T.</given-names>
            <surname>Tubiana</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.-C.</given-names>
            <surname>Carvaillo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Boulard</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Bressanelli</surname>
          </string-name>
          , “TTClust:
          <string-name>
            <given-names>A Versatile</given-names>
            <surname>Molecular Simulation Trajectory Clustering Program with Graphical Summaries</surname>
          </string-name>
          ,
          <source>” Journal of Chemical Information and Modeling</source>
          , vol.
          <volume>58</volume>
          , pp.
          <fpage>2178</fpage>
          -
          <lpage>2182</lpage>
          , nov
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>M.</given-names>
            <surname>Brehm</surname>
          </string-name>
          , M. Thomas,
          <string-name>
            <given-names>S.</given-names>
            <surname>Gehrke</surname>
          </string-name>
          , and
          <string-name>
            <given-names>B.</given-names>
            <surname>Kirchner</surname>
          </string-name>
          , “
          <article-title>TRAVIS-A free analyzer for trajectories from molecular simulation</article-title>
          ,”
          <source>The Journal of Chemical Physics</source>
          , vol.
          <volume>152</volume>
          , p.
          <volume>164105</volume>
          , apr
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>A.</given-names>
            <surname>Shkurti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Goni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Andrio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Breitmoser</surname>
          </string-name>
          , I. Bethune,
          <string-name>
            <given-names>M.</given-names>
            <surname>Orozco</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C. A.</given-names>
            <surname>Laughton</surname>
          </string-name>
          , “
          <article-title>pyPcazip: A PCA-based toolkit for compression and analysis of molecular simulation data,” SoftwareX</article-title>
          , vol.
          <volume>5</volume>
          , pp.
          <fpage>44</fpage>
          -
          <lpage>50</lpage>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>E.</given-names>
            <surname>Swann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. M.</given-names>
            <surname>Cleland</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A. S.</given-names>
            <surname>Barnard</surname>
          </string-name>
          , “
          <article-title>Representing molecular and materials data for unsupervised machine learning</article-title>
          ,
          <source>” Molecular Simulation</source>
          , vol.
          <volume>44</volume>
          , pp.
          <fpage>905</fpage>
          -
          <lpage>920</lpage>
          , jul
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>B. K.</given-names>
            <surname>Carpenter</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. S.</given-names>
            <surname>Ezra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. C.</given-names>
            <surname>Farantos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z. C.</given-names>
            <surname>Kramer</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Wiggins</surname>
          </string-name>
          , “
          <article-title>Empirical Classification of Trajectory Data: An Opportunity for the Use of Machine Learning in Molecular Dynamics,”</article-title>
          <source>The Journal of Physical Chemistry B</source>
          , vol.
          <volume>122</volume>
          , pp.
          <fpage>3230</fpage>
          -
          <lpage>3241</lpage>
          , apr
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>M.</given-names>
            <surname>Haghighatlari</surname>
          </string-name>
          and
          <string-name>
            <given-names>J.</given-names>
            <surname>Hachmann</surname>
          </string-name>
          , “
          <article-title>Advances of machine learning in molecular modeling and simulation,” Current Opinion in Chemical Engineering</article-title>
          , vol.
          <volume>23</volume>
          , pp.
          <fpage>51</fpage>
          -
          <lpage>57</lpage>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>H.</given-names>
            <surname>Sidky</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A. L.</given-names>
            <surname>Ferguson</surname>
          </string-name>
          , “
          <article-title>Machine learning for collective variable discovery and enhanced sampling in biomolecular simulation</article-title>
          ,
          <source>” Molecular Physics</source>
          , vol.
          <volume>118</volume>
          , p.
          <fpage>e1737742</fpage>
          , mar
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>F.</given-names>
            <surname>Noé</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Tkatchenko</surname>
          </string-name>
          ,
          <string-name>
            <surname>K.-R. Müller</surname>
            , and
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Clementi</surname>
          </string-name>
          , “
          <source>Machine Learning for Molecular Simulation,” Annual Review of Physical Chemistry</source>
          , vol.
          <volume>71</volume>
          , pp.
          <fpage>361</fpage>
          -
          <lpage>390</lpage>
          , apr
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>M. J. Abraham</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Murtola</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Schulz</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Páll</surname>
            ,
            <given-names>J. C.</given-names>
          </string-name>
          <string-name>
            <surname>Smith</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Hess</surname>
          </string-name>
          , and E. Lindahl, “
          <article-title>GROMACS: High performance molecular simulations through multi-level parallelism from laptops to supercomputers,” SoftwareX</article-title>
          , vol.
          <volume>1</volume>
          -
          <issue>2</issue>
          , pp.
          <fpage>19</fpage>
          -
          <lpage>25</lpage>
          , sep
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>S. N.</given-names>
            <surname>Karozis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. I.</given-names>
            <surname>Mavroudakis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. C.</given-names>
            <surname>Charalambopoulou</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M. E.</given-names>
            <surname>Kainourgiakis</surname>
          </string-name>
          , “
          <article-title>Molecular simulations of self-assembled ceramide bilayers: comparison of structural and barrier properties</article-title>
          ,
          <source>” Molecular Simulation</source>
          , vol.
          <volume>46</volume>
          , pp.
          <fpage>323</fpage>
          -
          <lpage>331</lpage>
          , mar
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>