=Paper= {{Paper |id=Vol-1686/WSSSPE4_paper_35 |storemode=property |title=Development of a software framework for formalizing forcefield atom-typing for molecular simulation |pdfUrl=https://ceur-ws.org/Vol-1686/WSSSPE4_paper_35.pdf |volume=Vol-1686 |authors=Janos Sallai,Christopher Iacovella,Christoph Klein,Tengyu Ma }} ==Development of a software framework for formalizing forcefield atom-typing for molecular simulation== https://ceur-ws.org/Vol-1686/WSSSPE4_paper_35.pdf
Idea Paper: Development of a Software Framework for Formalizing Forcefield
                  Atom-Typing for Molecular Simulation

                      Christopher R. Iacovella, Janos Sallai, Christoph Klein, Tengyu Ma
                                       Vanderbilt University, Nashville, TN
                 {janos.sallai,christopher.r.iacovella, christoph.klein,tengyu.ma}@vanderbilt.edu


   Abstract—Forcefields are a crucial ingredient of Molec-       researchers to instead focus their efforts on the motivat-
ular Dynamics (MD) simulations, describing the types and         ing scientific questions. For example, the large number
parameters of interactions between the simulated particles.      of parameters available for the OPLS forcefield has
These parameter sets, however, are typically specific to the
molecule in which the atoms appears, where within the            allowed for the automated screening of drug molecules in
molecule the atom is positioned, the phase or state point        order to identify promising candidates for more effective
of the system, as well as the simulator tool in use. This        treatments of HIV [11].
makes choosing the correct parameter values a tedious
and error prone task. Forcefield parameters, furthermore,
are often hard to locate: some are published in scientific
papers, others come with MD tools, often with no or
ambiguous documentation on their applicability. In this
paper, we present a framework that aims to solve this
data management issue, proposing a common format for
forcefields that is self-documenting with machine readable,
declarative usage rules. We believe that processes and tools
that are commonly used today in software development (e.g,
unit testing, verification and validation, continuous integra-   Fig. 1. Perfluorobutane, CF3 -CF2 -CF2 -CF3 , molecule shown
tion, and version control) are, with proper infrastructure       with ball and stick representation.
support, applicable to forcefield development, as well. The
paper describes how such an infrastructure can tackle man-          However, while researchers do not necessarily need
aging and evolving forcefields by the MD community, and          to spend time developing the forcefield parameters, de-
proposes a way to encourage and incentivize involvement          termining which parameters to use (i.e., atom-typing)
by the stakeholders.                                             is still often a tedious and error prone task. In many
                                                                 forcefields, the appropriate interaction parameters will
I. Introduction                                                  depend on the local topological environment of atoms.
Molecular simulation plays a key role in understanding           For example, the non-bonded interactions of carbon
the atomistic and molecular level interactions that un-          atoms in perfluoroalkanes (PFA) are typically differ-
derlie many natural and man-made materials and pro-              ent for terminal carbons verses “middle” carbons [12].
cesses. [1]–[4] Classical molecular simulations rely upon        Similarly, when identifying the correct parameters for
forcefields to describe the various interactions that exist      a torsional term, one must typically consider not just
between atoms and/or groups of atoms, including non-             the backbone, (e.g., C-C-C-C), but also the bonds of
bonded interactions (van der Waals and electrostatics)           each atom in the backbone (e.g., CH3 -CH2 -CH2 -CH3 ,
and bonded interactions (bonds, angles, and torsions).           vs. CF3 -CF2 -CF2 -CF3 ) [7], [13]. Consequently, a given
These forcefields are typically expressed as a set of            forcefield may include multiple unique parameters for
analytical function with adjustable fitting parameters           a given atom, whose usage depends on the chemical
for different atomic/molecular species. Considerable ef-         context of the atom. That is, the appropriate parameters
forts have been undertaken by many research groups               will typically depend on a number of factors, includ-
to develop accurate forcefields, both in determining the         ing the specific molecule in which the atom appears
mathematical functions and associated fitting parameters,        (e.g., alkanes vs. perfluoroalkanes), or where in that
for a large variety of molecular species under differ-           molecule the atom appears (e.g., terminal vs. middle).As
ent conditions. Numerous forcefields (and associated             a clear illustration of this, we note that the OPLS all-
parameters), have been devised with acronyms such                atom forcefield parameter database (as provided in the
as: AMBER [5], CHARMM [6], OPLS [7], SKS [1],                    TINKER molecular modeling software [14]) contains
TraPPE [8], COMPASS [9] and GROMOS [10]. The                     427 different “types” of carbon atoms, which are all
availability of these forcefields can significantly reduce       differentiated based on their chemical context. Further-
or completely eliminate the difficult and costly task of         more, determining which of the multitude of forcefield
determining the interactions between species, allowing           parameters is most appropriate, based on the chemical
This work is licensed under a CC-BY-4.0 license.                 context of an atom, is often accomplished through brief,
unstructured – and sometimes ambiguous – annotations            not inadvertently override other rules. This may impose
located in the forcefield parameter files. Even journal         practical limits on functionality, where, for example, a
articles associated with forcefield parameters can be           user is not able to easily extend the rules to include newer
unclear, where parameters are typically listed in table         parameters, or that such attempts to extend rules result
form, often with limited annotations or examples of             in incorrect atom-typing for other molecular species.
their usage. Also, given their static nature, parameters           To address these issues, we are developing a new
provided in journal articles may not be the most up-to-         framework for atom-typing, based upon first order logic
date, whether resulting from typographic errors or modi-        over graph structures. The novelty of our approach lies
fications in later work. As a result, for many forcefields,     in the declarative annotation syntax that allows for 1)
the formal logic required to distinguish atom types may         decoupling the definition of forcefield-specific atom-
be difficult to find, difficult to interpret, out-of-date, or   typing rules from how a forcefield-agnostic tool uses
simply ambiguous. This ambiguity is confounded by the           these annotations to automatically compute the atom-
fact that many published journal articles that rely on          typing of complex molecular systems, and 2) formally
forcefields often provide only vague citations for the          verifying and automatically validating annotated force-
source of the parameters.                                       fields. Specifically, we proposed to:
   Most research groups use some combination of hand              • establish a forcefield agnostic (i.e., general) for-
parameterization and logic based codes to facilitate the            malism to express the chemical context in which
process of atom-typing. Identifying atom types by hand,             a particular force field entry is applicable (i.e.,
while easy for small/simple molecules and important for             forcefield usage semantics);
validation, becomes impractical for large molecules or            • develop a tool suite that automates atom-typing of
systems with significant heterogeneity. For molecules               molecular structures using the semantics-annotated
with only a few atomic species and little variability, ad           forcefields;
hoc, logic-based codes are relatively straightforward to          • and establish a development process to create, incre-
write and test. However, as the number of unique atom               mentally extend, and evolve annotated forcefields,
types increases, properly defining the appropriate if/else          providing tools for automated verification, valida-
statements required in most logic codes becomes oner-               tion, and continuous integration techniques.
ous, even for people who are well versed in programming            An important goal of the project is to disseminate
and have domain expertise in molecular modeling. Also,          results and to foster community involvement through
nested logic statements are often difficult to adequately       the creation of a centralized online forcefield repository
test and debug and thus introduce a potential source            containing the formally annotated forcefields, documen-
of error. Since ad hoc codes are not typically released         tation of how to annotate forcefields, along with files
to the general community, any errors in the code may            for benchmarking and validation of the atom-typing
go undiscovered and thus data based on flawed atom-             software. Given the importance of accurately applying
typing may appear in published scientific literature. To        forcefields to molecular simulation, the new atom-typing
address this, several community tools and approaches            framework developed in this work has the potential
have been developed to aid in atom-typing [15]–[20],            to significantly impact the community, affording re-
some of which are more generally applicable than others.        searchers greater confidence in the model parameters
Forcefields developed in the biophysics community tend          used in their studies, eliminate the need to develop ad
to have exceptionally well vetted parameterization codes,       hoc atom-typing codes, make forcefield parameter usage
such as AMBER’s antechamber [21], but such tools are            clearer, and significantly reduce incorrect atom-typing as
typically designed to only work with their associated           a source of error and inconsistency in published results.
forcefield and may also produce output specific to a
given simulation package, rather than a general form. In        II. Vision
general, these tools all rely on a hierarchy where rules        In our vision, forcefield definitions are not just sets of
that identify more specialized atom types must be called        tables with numerical data, but as unambiguous, well-
in precise order [21], such that more general atom types        structured documents with rich metadata, including the
are only chosen when more specialized matches do not            formal description of the chemical context in which the
exist (i.e., they include rule precedence). Maintaining, let    particular parameter values can be applied. In the future,
alone constructing, these hierarchies is extremely error        the tedious and error prone task of manually atom-typing
prone and, just as in ad hoc codes, typically results           and parameterizing complex molecular models (inputs
in source code with deeply nested if/else statements.           to simulators) will be eliminated, and replaced by the
In these hierarchical schemes, in order to add a new            automated process that relies on the machine-readable
atom type or correct an error, a developer must have            forcefield usage semantics.
a complete picture of the hierarchy and know exactly               We envision forcefields as dynamic data entities
where the relevant rule should be placed, such that it does     that evolve over time: parameterizations of additional
chemical species are added, already supported chem-             evaluates to true if and only if the chemical species
ical species are specialized, and parameter values are          of the atom is carbon (here, C is a built-in type);
tweaked for better modeling of the chemical interactions.       bonded atoms() evaluates to the set of all atoms bonded
Forcefield definitions will be maintained by online com-        to the one of interest, which can be further filtered
munities and hosted at shared repositories with version         by predicates. For instance, bonded atoms(type!=C)
control capabilities. These online repositories will also       would evaluate to a set of all non-carbon bonded atoms.
support continuous integration (CI), i.e., automated ver-       The proposed DSL will also supports functions on sets,
ification, validation, and testing of the forcefields as they   such as count(), and common set operations such as
evolve.                                                         union, intersection, difference, etc. Also, the formalism
   We believe that the developers of forcefield definitions     will support the existential and universal quantifiers
deserve credit for their efforts. The online repositories       (exists() and f or all()) to allow evaluating first-order
will generate permanent URL links per forcefield ver-           logic statements over sets. Furthermore, the language
sions, as well as document object identifiers (DOI) which       will include support, through built-in functions, to ex-
allow unambiguously referencing these data artifacts,           press common molecular structures that are too verbose
and properly citing them in scientific publications.            to express otherwise, such as, for instance, rings of a
                                                                particular size.
III. Approach                                                      For convenience, we will allow the annotations to
The work to be carried out can be roughly broken down
                                                                reference user-defined types in the language (not just
into two main efforts: (1) the development of the atom-
                                                                the built-in chemical species, such as C and F). The
typing framework, including the formalism to express
                                                                statement
forcefield usage semantics, and (2) examination of case
                                                                  C791 : type = C &
studies designed to test, validate, and refine the atom-
typing framework. We note that these efforts will be                     count(bonded atoms(type = F )) = 3 & (3)
executed concurrently, to provide a continual loop of                    count(bonded atoms(type = C792 )) = 1
development, testing, and refinement.                           would specify that the carbon atom at the end of the
A. Annotation Syntax                                            fluorocarbon chain must have a C729 type neighbor,
One of the main goals of the proposed effort is to              which is a carbon in the fluorocarbon backbone. This is
design a domain specific language (DSL) for annotating          clearly a more restrictive atom type usage specification
forcefield parameters. The DSL will be used to express          than Eq. 1, allowing it to more specifically express the
the chemical context in which the particular atom type is       chemical context.
applicable. This DSL will, effectively, serve as a means           It is crucial that the annotation syntax we propose be
to unambiguously document the forcefield. The syntax            future-proof and support the evolution of forcefields.
we propose will be expressive, unambiguous, and both            Evolution can mean two things: a.) the forcefield gets
human and machine readable.                                     extended with support for new chemical species, or b.)
   Consider the following example of tagging the carbon         already supported species get more specialized. Annotat-
atoms in the perfluorobutane (shown in Fig. 1) with             ing the atom types for the newly added chemical species
its usage semantics to create the DSL. Atom type 791            can trivially be done incrementally. However, when an
in TINKER’s OPLS-aa parameter database [14] corre-              existing atom type is specialized, other existing atom-
sponds to a terminal carbon of a perfluorobutane chain.         typing rules referencing the specialized one can also be
This will be annotated with the following statement:            affected. For instance, let us assume that we want to
  C791 : type = C &                                             distinguish between C791 -type fluorocarbon end groups
                                                                based on what kind of carbon they are bonded to. Let us
         count(bonded atoms(type = F )) = 3 & (1)
                                                                assume that to do this, we would remove the C791 atom
         count(bonded atoms(type = C)) = 1.                     type from the forcefield and replace it with C791A for
   This means that atom type C791 is a carbon (C) atom,         a chemical context when the carbon neighbor is part of
which can be used in a chemical context where it has            a fluorocarbon backbone (C791 ), and with C791B if it is
3 bonded fluorines (F ), and one carbon. Similarly, atom        not. Since the generic C791 type has been removed from
type 792, applicable to carbon atoms that are part of a         the forcefield, all existing annotations that reference it
fluorocarbon backbone, would be annotated as                    need to be changed to reference C791A and/or C791B .
   C792 : type = C &                                               We want to avoid this, because it would make ex-
                                                                tending annotated forcefields a laborious and error-prone
         count(bonded atoms(type = C)) = 2 & (2)
                                                                task, which would hinder the wide-spread acceptance
         count(bonded atoms(type = F )) = 2.                    of our proposed formalism, and would jeopardize the
  Notice that the above annotations are logic statements        success of our efforts. To tackle this issue, the anno-
consisting of predicates over the topology: type = C            tation language will allow multiple atom types to be
assigned to a given atom in a molecule, but the atom        handle molecules with ring structures, a global invariant
type annotations will be required to explicitly express     can state that:
the specialization relations (i.e., that atom type C792A
                                                                        f or all(type = C, !in ring()).           (6)
overrides C792 ). This way, if multiple atom types are
applicable to particular chemical context, it is the most   We note that as part of defining the DSL for annotating
specialized one that will get assigned to the given atom.   the forcefield, we will develop various “helper” tools to
The following example demonstrates a possible syntax        ensure proper syntax usage.
to achieve that, by including the definition of a more
                                                            B. The Atom-Typing Tool
generic perfluoroalkane carbon, CP F A .
                                                            Our approach differs from existing atom-typing tools in
CP F A : type = C &                                         a number of ways. First, existing tools are forcefield-
          count(bonded atoms()) = 4 &                       specific, while the proposed atom-typer is forcefield-
          f or all(bonded atoms(), type = Cktype = F ) agnostic. Second, the common practice is to hard-code
                                                            the atom-typing logic into the tool’s source code, which
  C791 : type = CP F A &                                    makes the code hard to extend as the forcefield evolves.
          count(bonded atoms(type = F )) = 3 &              The formalism we propose factors out the atom-typing
          count(bonded atoms(type = C)) = 1                 logic into declarative annotations, which, instead of
          @overrides(CP F A )                               describing how atom typing rules are executed, states
                                                            the invariants of the chemical context that must hold
C791A : type = C791 &                                       true for a given atom type or parameters. The proposed
          count(bonded atoms(type = C792 )) = 1             atom-typer will interpret this declarative formalism, and
          @overrides(C791 )                                 compute the atom type assignments that satisfy the
C791B : type = C791 &                                       invariants. Third, existing tools often do not cover all
                                                            chemical contexts supported by the forcefield, but do
          !exists(bonded atoms(), type = C792 )             produce some (potentially bogus) output when run on
          @overrides(C791 )                                 topologies that include such contexts. Our tool relies on
                                                        (4) a three-pronged approach to alleviate such problems: 1)
   Support for overriding annotations is important for formal verification of forcefield annotations that reveal
two reasons. First, it allows for incremental development omissions and contradictions, 2) global invariants in the
of forcefields, and second, the overridden, more general forcefield that map to assertions on the input topologies,
atom types can be used as wildcards in references. Con- warning the user of unsupported molecular features, and
sider the following annotation of a torsional term defined 3) a proposed test suite and continuous integration setup
over four carbon atoms on a fluorocarbon backbone that validates the annotated forcefields against known,
(bonds, and angles are annotated similarly):                correctly atom-typed topologies.
                                                               An important goal of the proposed effort is to lever-
C13 C13 C13 C13P F A : (CP F A , C792 , C792 , CP F A )
                                                            age the annotated forcefields to automatically atom-
                                                        (5)
                                                            type complex molecular systems. The description of the
This annotation defines that the torsional term molecular system must include at least the chemical
C13 C13 C13 C13P F A is applicable to a series of species of the atoms (element names or atomic numbers),
four carbon atoms such that the first and the last one and their connectivity (bonds), both of which could be
can be any kind of carbon in a fluorocarbon molecule, provided, for instance, from our previously developed
but the two middle atoms must be of a more specific mBuild tool [22], [23]. The atom-typer tool we propose
type, C792 , with exactly two carbon neighbors. This to develop will read this topology, along with an anno-
allows us to uniquely identify bonded interactions for tated forcefield specification, and will produce an output
atom-types that are of the same “atom class.” The DSL topology with the forcefield specific atom types, as well
can also be extended to provide other parameters that as bonded interaction types (bonds, angles, torsions, etc.)
may be required, for example, a molecular length option and associated parameters, which can be used as input
to unique identify instances when a specific torsion to a molecular simulator.
parameter should be used.                                      For the atom-typer tool, we will investigate the fol-
   Apart from annotating atom types and various bonded lowing implementation approaches:
interaction types (bonds, angles, torsions) in the force-      Naive approach. A naive algorithm first orders the
field, the proposed language supports global invariants, atom type annotations according to the “referenced-by”
as well. These invariants can be used to express con- relationship, that is, if annotation C791 references anno-
straints on the applicability of the forcefield as a whole, tation C792 , then C791 will precede C792 in the order.
and are essential to prevent force field use on unsup- Then, following this ordering, the algorithm evaluates
ported topologies. For example, if the force field cannot all atom-type annotations for all atoms in the input
topology. This ensures that all referenced atom-typing         mapped to Datalog, what are the performance impli-
rules are evaluated before those that reference them. If       cations, including Datalog query execution scales with
multiple annotations evaluated to true for a given atom        the topology size and the number of atom types in the
in the input topology (e.g. CP F A , C791 , and C791A ), the   forcefield.
most specialized type (C791A ) will be chosen, following          Subgraph isomorphism. We expect that many (but
the @overrides relations of the annotations. Obviously,        not all) of the forcefield annotations will, implicitly,
this naive approach would fail if no partial ordering          describe the chemical context as a subgraph. While the
of atom type annotations would exist, i.e., if there are       subgraph isomorphism problem, in the general case, is
annotations that mutually reference each other, or if          NP-complete [27], this is not the case with the problem
circular dependencies exist.                                   of finding where a particular chemical neighborhood is
   Fixpoint-based method. Allowing recursion in anno-          present in the overall topology [28]. This is due to the
tations would make the language more expressive. But in        fact that chemical topologies belong a special class of
order to accommodate recursive annotations, we need to         graphs (planar graphs) where the maximum number of
tweak the naive approach. First, the fixpoint-based solver     bonded atoms is well defined (i.e., never more than 4 for
evaluates the primitive rules, i.e., the ones that do not      covalently bonded systems), and that the chemical neigh-
reference others, for every atom in the topology. If an        borhoods in question tend to be relatively small. State of
annotation evaluates to true (on any atom), it enables         the art graph matching solutions that employ a plethora
those annotations that reference it. When the enabled          of optimization techniques and exploit parallelism today
annotation is then evaluated, it, in turn, may enable or re-   can scale up to graphs with 109 nodes [29].
enable others. This continues until no further annotations        We suspect that for certain kinds of annotations,
are enabled (that is, a fixpoint is reached), which is the     especially those involving ring structures, a subgraph
termination condition of the iteration.                        matching based approach would provide better perfor-
   Logic programming. Logic programming lan-                   mance than directly mapping such atom type annotations
guages [24], such as Prolog [25] or Datalog [26], that         to, for instance, Datalog rules. Therefore, we envision
have been designed for deductive reasoning, are partic-        that the final version of the atom-typing tool will borrow
ularly useful in this context. Logic programs consists of      from multiple of the above mentioned approaches.
facts, i.e., things that always hold true (e.g., “Mickey          It is important to note that the declarative nature of
is a mouse”, “Pluto is a dog”, “Mars is a planet”),            the proposed annotation language allows us to decouple
and deduction rules that can be used to define relations       the forcefield specification (what statements must hold
(e.g., “a mouse is an animal”, “a dog is an animal”).          true for a correctly atom-typed topology) from execution
The program is then run by the user posting queries            (how it is achieved). That is, the particular execution
(e.g., “list all animals”), which the interpreter runtime      approach we will eventually choose is orthogonal to
executes and evaluates (returning Mickey and Pluto in          the forcefield annotation syntax, so the implementation
this example). By representing the “bonded-to” relations       of the atom-typer tool can evolve independently of the
of the input topology as tuples, and evaluating certain        forcefield annotations.
non-logic functions in advance (e.g., enumerating ring
                                                               C. Automated Verification, Validation, and Testing
structures), we can encode them as facts in a logic pro-
gram (e.g., “atom1 is a carbon”, “atom2 is a hydrogen”,        Annotating a forcefield with atom-typing semantics is
“atom1 is part of ring1”, “atom1 is bonded to atom2”,          a major undertaking, and it is inevitable that we make
etc.). Similarly, the annotations of the forcefield’s atom     errors on the way. The same is true for a complex piece
types will be mapped to deduction rules, (e.g., “a carbon      of software, such as the atom-typer tool. We believe that
atom with 4 bonded hydrogens is a methane carbon”).            through proper testing, verification, and validation, we
Then, the logic program can be run by executing queries        can build quality forcefield annotations and tools that
to list atoms with each atom type (e.g., “list all methane     either produce the correct atom-typing results, or fail
carbons”).                                                     with adequate warnings or error messages. Our approach
                                                               is unique in that it provides two very distinct types of
   An important difference between Prolog and Datalog
                                                               testing: 1) verification of the underlying rules to eval-
is that while Prolog is an expressive, general purpuse
                                                               uate inconsistencies and 2) validation of the outputted
logic programming language, Datalog is not Turing-
                                                               molecular models.
complete, and mostly focuses on reasoning about data.
Also, Datalog imposes restrictions on the the use of           1) Verification
negation and recursion (Prolog does not), however, Dat-        With verification, we want to answer the question: “Are
alog queries are always guaranteed to terminate (while         we annotating the forcefield correctly?” The fact that our
in Prolog, there are no such guarantee).                       proposed annotation syntax can be mapped to first-order
   In the proposed effort, we will investigate how force-      logic statements makes it possible to “reason about” the
field usage specifications and input topologies can be         annotated forcefield as system of atom-typing rules, even
without applying them to molecular topologies. That is,         We propose to develop a validator tool that does just
in logic programming terms, we can reason about the             this this: takes an annotated forcefield as an input, and
rules without the facts. We cannot emphasize enough             iterates through a large number of correctly atom-typed
the importance of formal verification here. It may reveal       topologies ensuring that the atom-type annotations and
subtle and latent errors in the annotation logic of the         global invariants are never violated. If the validator finds
forcefield that would be only possible to detect through        a topology that is in contradiction with the forcefield
thorough testing by running the atom-typer on a large           annotations, it is the chemical scientist who needs to
swath of different topologies.                                  revisit and correct the forcefield, rather than a source
   Are there rules that are not decidable? Are there any        code issue.
annotations that will never evaluate to true, irrespective
                                                                3) Testing The Atom-Typer Tool
of the input chemical topology? Can it ever happen that
two rules may hold true at the same time, without one           The development and testing of the atom-typer tool is
overriding the other? Are there any atom-typing rules           much alike that of any software. Importantly, it can be
that are in contradiction with the global invariants? If        carried out without chemical domain expertise. This is
the answer to any of the above questions is yes, it             analogous to developing and testing a database server:
indicates an error in the logical structure of the forcefield   the software developers are not concerned with what
annotations. We propose to verify such properties on the        data, in what schemas, will the users store in the
system of forcefield annotations, and will provide a set        database, but rather focus on testing the functionality that
of development tools that help the forcefield developers        implements how queries are answered. In the proposed
pinpoint these inconsistencies early in the forcefield          effort, we will use state-of-the-art software testing tools,
development process. We expect that some of the above           including unit tests packaged into test suites, as well
violations will be detectable by the Datalog interpreter        as coverage tools to quantify the degree to which the
after the annotations are mapped to Datalog syntax,             source code is tested. The source code of the tools will
while others, pertaining to ensuring boolean satisfiability,    be stored on GitHub [31], leveraging its collaborative
will require us to integrate a theorem prover such as the       development, version control, and issue tracking facili-
widely used Z3 solver from Microsoft Research [30].             ties.
   Our proposed goal is to prove that the forcefield is         D. Continuous Integration
complete in the logical sense, that is, if the forcefield’s     Continuous integration (CI) is a software development
annotations evaluate to “true” on a correctly atom-typed        practice that encourages members of the development
topology, then the atom-typing tool is guaranteed to            team to integrate code into a shared repository several
compute the correct atom-typing, given the annotated            times a day. On each check-in, the software is au-
forcefield, the chemical species of the atoms and their         tomatically built and tested, allowing teams to detect
connectivity as inputs. We will strive to reach this goal,      problems early. Commonly, CI is provided as a cloud
even at the cost of reducing the expressiveness of the          service. Developers set up a project with a CI service
annotation language.                                            provider (e.g., Travis CI [32], CodeShip [33], etc.), and
2) Validation of Forcefield Annotations                         it is the CI service that watches the project’s source code
With validation, we want to answer the question: “Did           repository (e.g., GitHub) for changes, attempts to build
we come up with the correct annotations?” If we                 and test the code in virtual machines in the cloud, and
run the atom-typer on a molecular topology and it does          reports build and test results to the developers.
not produce the expected results, it is vital to know –             In the proposed effort, we will apply the CI approach
particularly for an interdisciplinary team of chemical          to the development process of both the annotated force-
scientists and computer engineers – whether the error           fields and the corresponding software tools (atom-typer,
is in the forcefield annotations or in the atom-typer’s         verifier, validator, etc.). For the software artifacts, CI
source code. With validation, we want to focus on               will be hosted at Travis CI. For the automated verifi-
errors in the forcefield annotations, without having to         cation and validation (V&V) of annotated forcefields,
worry about potential software bugs in the atom-typer’s         we propose to develop our own cloud service, based on
implementation.                                                 BuildBot [34], an open-source framework for automating
   Notice that if an annotated forcefield passes verifica-      software build, test, and release processes. The proposed
tion (i.e., it is complete in the logical sense), we can        forcefield CI service (see Fig. 2) will watch the repos-
validate it in an isolated way, without having to execute       itories where the annotated forcefield files are hosted.
the atom-typer tool. It is sufficient to check that the         Also, it will integrate with existing online repositories
forcefield’s annotation statements hold true on a set of        that host correctly atom-typed molecules, which will be
correctly atom-typed topologies (test cases), which will        automatically downloaded and used as the “ground truth”
entail, due to the completeness property, that the atom-        for forcefield validation. V&V of annotated force field
typer will be able to compute the correct atom-typing.          files will be triggered when either a.) changes in the
                       Fig. 2. Continuous integration workflow of annotated forcefield development.

repository are detected b.) new correctly-typed molecules      specialize, and become more complex. The machine-
are added.                                                     readable annotations of forcefield usage semantics will
E. Incentivizing Community Involvement                         enable automation of tedious and error prone tasks,
                                                               and will enable new application areas, ranging from
Although there exist some meticulously well maintained
                                                               automated forcefield comparison and cross-validation,
and “alive” forcefield repositories, that are, not surpris-
                                                               to complex simulation workflows integrating multiple
ingly, typically specific to a particular simulator tool
                                                               forcefields and simulator tools. Through offering our
or chemical or biomolecular domain, cataloging force-
                                                               verification, validation, and testing infrastructure as a
field development of the past decades into a coherent,
                                                               free-of-charge continuous integration service to the com-
unified, searchable online forcefield database would be
                                                               munity, we believe that our approach has the potential to
an enormous undertaking, which is not feasible without
                                                               foster community involvement through the creation of a
community involvement.
                                                               online forcefield repository for the annotated forcefields
   The implementation of such an online database would
                                                               that tagged with DOIs for proper referencing and attribu-
not be technically challenging. Neither would be operat-
                                                               tion, the associated software, and documentation of our
ing and maintaining such a service. However, convincing
                                                               framework.
the developers of new forcefields to upload their param-
eter sets, to correctly tag them with how and in what          Acknowledgments
context the values are applicable, to supply test cases,       This work is supported by the National Science Foun-
etc., would surely be a futile attempt.                        dation under grant number ACI 1535150.
   It is well known it is notoriously hard to get outside
users to upload their work to a repository. A novel aspect     References
of our community building approach is that we will              [1] J. I. Siepmann, S. Karaborni, and B. Smit, “Simulating the
                                                                    critical behaviour of complex fluids,” Nature, vol. 365,
incentivize our users to do so by giving them free access           pp. 330–332, 1993.
to our continuous integration infrastructure: outside users     [2] S. Auer and D. Frenkel, “Prediction of absolute
will register their own repositories with the forcefield            crystal-nucleation rate in hard-sphere colloids.,” Nature,
                                                                    vol. 409, pp. 1020–1023, 2001.
CI service, which will provide automatic, continuous            [3] A. Haji-Akbari, M. Engel, A. S. Keys, X. Zheng, R. G.
verification, validation, and testing for their forcefields         Petschek, P. Palffy-Muhoray, and S. C. Glotzer, “Disordered,
at no cost.                                                         quasicrystalline and crystalline phases of densely packed
                                                                    tetrahedra.,” Nature, vol. 462, pp. 773–777, 2009.
   Whenever a new commit or pull request adds or                [4] G. Feng and P. T. Cummings, “Supercapacitor capacitance
modifies forcefield related files in the users’ registered          exhibits oscillatory behavior as a function of nanopore size,”
repositories, the CI service will trigger the verification          Journal of Physical Chemistry Letters, vol. 2, pp. 2859–2864,
                                                                    2011.
and validation workflow. Once a forcefield (or a new            [5] S. J. Weiner, P. A. Kollman, D. A. Case, U. C. Singh, C. Ghio,
version of a forcefield) passes all tests, the CI service           G. Alagona, S. Profeta, and P. Weinerl, “A New Force Field for
will publish it (that is, the annotated parameter set,              Molecular Mechanical Simulation of Nucleic Acids and
                                                                    Proteins,” Journal of the American Chemical Society, vol. 106,
its source URL, and the test results) to our online                 pp. 765–784, 1984.
forcefield repository, assigning to it a permanent URL          [6] A. D. MacKerell, N. Banavali, and N. Foloppe, “Development
and a document object identifier (DOI) for referencing              and current status of the CHARMM force field for nucleic
                                                                    acids,” Biopolymers, vol. 56, pp. 257–265, 2000.
and attribution purposes.                                       [7] W. L. Jorgensen, D. S. Maxwell, and J. Tirado-Rives,
                                                                    “Development and Testing of the OPLS All-Atom Force Field
IV. Conclusion                                                      on Conformational Energetics and Properties of Organic
Through the development of a new formalism for chemi-               Liquids,” Journal of the American Chemical Society, vol. 118,
cal context and novel atom-typing scheme, our approach              pp. 11225–11236, Jan. 1996.
                                                                [8] J. J. Potoff and J. I. Siepmann, “Vapor-liquid equilibria of
unambiguously describes the appropriate usage of force-             mixtures containing alkanes, carbon dioxide, and nitrogen,”
field parameters and helps to reduce atom-typing as a               AIChE Journal, vol. 47, pp. 1676–1682, 2001.
source of error during model development. Developing            [9] H. Sun, “COMPASS: An ab Initio Force-Field Optimized for
                                                                    Condensed-Phase Applications s Overview with Details on
this framework will simplify the rules needed for atom-             Alkane and Benzene Compounds,” Journal of Physical
typing, which is crucial as forcefields continue to grow,           Chemistry, vol. 5647, pp. 7338–7364, 1998.
[10] C. Oostenbrink, A. Villa, A. E. Mark, and W. F. Van                     (Philadelphia, PA, USA), pp. 632–640, Society for Industrial
     Gunsteren, “A biomolecular force field based on the free                and Applied Mathematics, 1995.
     enthalpy of hydration and solvation: The GROMOS force-field        [29] Z. Sun, H. Wang, H. Wang, B. Shao, and J. Li, “Efficient
     parameter sets 53A5 and 53A6,” Journal of Computational                 subgraph matching on billion node graphs,” Proceedings of the
     Chemistry, vol. 25, pp. 1656–1676, 2004.                                VLDB Endowment, vol. 5, pp. 788–799, May 2012.
[11] R. C. Rizzo, J. Tirado-Rives, and W. L. Jorgensen, “Estimation     [30] L. De Moura and N. Bjø rner, “Z3: An efficient SMT solver,”
     of binding affinities for HEPT and nevirapine analogues with            in Tools and Algorithms for the Construction and Analysis of
     HIV-1 reverse transcriptase via Monte Carlo simulations,”               Systems, pp. 337–340, Springer Berlin Heidelberg, 2008.
     Journal of Medicinal Chemistry, vol. 44, pp. 145–154, 2001.        [31] Http://github.com, “GitHub: powerful collaboration, code
[12] M. G. Martin and J. I. Siepmann, “Transferable Potentials for           review, and code management for open source and private
     Phase Equilibria. 1. United-Atom Description of n -Alkanes,”            projects.”
     The Journal of Physical Chemistry B, vol. 102, no. 97,             [32] Http://travis-ci.org, “Travis CI: A hosted continuous integration
     pp. 2569–2577, 1998.                                                    service.”
[13] E. K. Watkins and W. L. Jorgensen, “Perfluoroalkanes:              [33] Http://codeship.com, “Codeship: A free hosted Continuous
     Conformational Analysis and Liquid-State Properties from ab             Delivery Service.”
     Initio and Monte Carlo Calculations,” The Journal of Physical      [34] Http://buildbot.net, “Buildbot: An open-source framework for
     Chemistry A, vol. 105, pp. 4118–4125, 2001.                             automating software build, test, and release processes..”
[14] J. W. Ponder, “TINKER Molecular Modeling Software.”
[15] B. L. Bush and R. P. Sheridan, “PATTY: A Programmable Atom
     Typer and Language for Automatic Classification of Atoms in
     Molecular Databases,” Journal of Chemical Information and
     Computer Sciences, vol. 33, pp. 756–762, 1993.
[16] A. W. Schüttelkopf and D. M. F. Van Aalten, “PRODRG: A
     tool for high-throughput crystallography of protein-ligand
     complexes,” Acta Crystallographica Section D: Biological
     Crystallography, vol. 60, pp. 1355–1363, 2004.
[17] A. A. S. T. Ribeiro, B. A. C. Horta, and R. B. De Alencastro,
     “MKTOP: A program for automatic construction of molecular
     topologies,” Journal of the Brazilian Chemical Society, vol. 19,
     no. 7, pp. 1433–1435, 2008.
[18] A. K. Malde, L. Zuo, M. Breeze, M. Stroet, D. Poger, P. C.
     Nair, C. Oostenbrink, and A. E. Mark, “An Automated force
     field Topology Builder (ATB) and repository: Version 1.0,”
     Journal of Chemical Theory and Computation, vol. 7,
     pp. 4026–4037, 2011.
[19] K. Vanommeslaeghe and a. D. MacKerell, “Automation of the
     CHARMM General Force Field (CGenFF) I: bond perception
     and atom typing.,” Journal of Chemical Information and
     Modeling, vol. 52, pp. 3144–54, Dec. 2012.
[20] J. D. Yesselman, D. J. Price, J. L. Knight, and C. L. Brooks,
     “MATCH: an atom-typing toolset for molecular mechanics
     force fields.,” Journal of Computational Chemistry, vol. 33,
     pp. 189–202, Jan. 2012.
[21] J. Wang, W. Wang, P. A. Kollman, and D. A. Case, “Automatic
     atom type and bond type perception in molecular mechanical
     calculations.,” Journal of Molecular Graphics & Modelling,
     vol. 25, pp. 247–60, Oct. 2006.
[22] C. Klein, J. Sallai, C. R. Iacovella, C. McCabe, and P. T.
     Cummings, “Mbuild: A Hierarchical, Component Based
     Molecule Builder,” in Poster Presentation: Computational
     Molecular Science and Engineering Forum, American Institute
     of Chemical Engineers Annual Meeting, Atlanta, GA, November
     18, 2014.
[23] J. Sallai, G. Varga, S. Toth, C. Iacovella, C. Klein, C. McCabe,
     A. Ledeczi, and P. T. Cummings, “Web- and Cloud-based
     Software Infrastructure for Materials Design,” Procedia
     Computer Science, vol. 29, pp. 2034–2044, 2014.
[24] J. W. Lloyd, “Foundations of logic programming; (2nd
     extended ed.),” Jan. 1987.
[25] W. Clocksin and C. S. Mellish, Programming in PROLOG.
     Springer Science & Business Media, 2003.
[26] S. Ceri, G. Gottlob, and L. Tanca, “What you always wanted to
     know about Datalog (and never dared to ask),” IEEE
     Transactions on Knowledge and Data Engineering, vol. 1,
     pp. 146–166, Mar. 1989.
[27] S. A. Cook, “The complexity of theorem-proving procedures,”
     in Proceedings of the Third Annual ACM Symposium on Theory
     of Computing, STOC ’71, (New York, NY, USA), pp. 151–158,
     ACM, 1971.
[28] D. Eppstein, “Subgraph isomorphism in planar graphs and
     related problems,” in Proceedings of the Sixth Annual
     ACM-SIAM Symposium on Discrete Algorithms, SODA ’95,