=Paper=
{{Paper
|id=Vol-1686/WSSSPE4_paper_35
|storemode=property
|title=Development of a software framework for formalizing forcefield atom-typing for molecular simulation
|pdfUrl=https://ceur-ws.org/Vol-1686/WSSSPE4_paper_35.pdf
|volume=Vol-1686
|authors=Janos Sallai,Christopher Iacovella,Christoph Klein,Tengyu Ma
}}
==Development of a software framework for formalizing forcefield atom-typing for molecular simulation==
Idea Paper: Development of a Software Framework for Formalizing Forcefield Atom-Typing for Molecular Simulation Christopher R. Iacovella, Janos Sallai, Christoph Klein, Tengyu Ma Vanderbilt University, Nashville, TN {janos.sallai,christopher.r.iacovella, christoph.klein,tengyu.ma}@vanderbilt.edu Abstract—Forcefields are a crucial ingredient of Molec- researchers to instead focus their efforts on the motivat- ular Dynamics (MD) simulations, describing the types and ing scientific questions. For example, the large number parameters of interactions between the simulated particles. of parameters available for the OPLS forcefield has These parameter sets, however, are typically specific to the molecule in which the atoms appears, where within the allowed for the automated screening of drug molecules in molecule the atom is positioned, the phase or state point order to identify promising candidates for more effective of the system, as well as the simulator tool in use. This treatments of HIV [11]. makes choosing the correct parameter values a tedious and error prone task. Forcefield parameters, furthermore, are often hard to locate: some are published in scientific papers, others come with MD tools, often with no or ambiguous documentation on their applicability. In this paper, we present a framework that aims to solve this data management issue, proposing a common format for forcefields that is self-documenting with machine readable, declarative usage rules. We believe that processes and tools that are commonly used today in software development (e.g, unit testing, verification and validation, continuous integra- Fig. 1. Perfluorobutane, CF3 -CF2 -CF2 -CF3 , molecule shown tion, and version control) are, with proper infrastructure with ball and stick representation. support, applicable to forcefield development, as well. The paper describes how such an infrastructure can tackle man- However, while researchers do not necessarily need aging and evolving forcefields by the MD community, and to spend time developing the forcefield parameters, de- proposes a way to encourage and incentivize involvement termining which parameters to use (i.e., atom-typing) by the stakeholders. is still often a tedious and error prone task. In many forcefields, the appropriate interaction parameters will I. Introduction depend on the local topological environment of atoms. Molecular simulation plays a key role in understanding For example, the non-bonded interactions of carbon the atomistic and molecular level interactions that un- atoms in perfluoroalkanes (PFA) are typically differ- derlie many natural and man-made materials and pro- ent for terminal carbons verses “middle” carbons [12]. cesses. [1]–[4] Classical molecular simulations rely upon Similarly, when identifying the correct parameters for forcefields to describe the various interactions that exist a torsional term, one must typically consider not just between atoms and/or groups of atoms, including non- the backbone, (e.g., C-C-C-C), but also the bonds of bonded interactions (van der Waals and electrostatics) each atom in the backbone (e.g., CH3 -CH2 -CH2 -CH3 , and bonded interactions (bonds, angles, and torsions). vs. CF3 -CF2 -CF2 -CF3 ) [7], [13]. Consequently, a given These forcefields are typically expressed as a set of forcefield may include multiple unique parameters for analytical function with adjustable fitting parameters a given atom, whose usage depends on the chemical for different atomic/molecular species. Considerable ef- context of the atom. That is, the appropriate parameters forts have been undertaken by many research groups will typically depend on a number of factors, includ- to develop accurate forcefields, both in determining the ing the specific molecule in which the atom appears mathematical functions and associated fitting parameters, (e.g., alkanes vs. perfluoroalkanes), or where in that for a large variety of molecular species under differ- molecule the atom appears (e.g., terminal vs. middle).As ent conditions. Numerous forcefields (and associated a clear illustration of this, we note that the OPLS all- parameters), have been devised with acronyms such atom forcefield parameter database (as provided in the as: AMBER [5], CHARMM [6], OPLS [7], SKS [1], TINKER molecular modeling software [14]) contains TraPPE [8], COMPASS [9] and GROMOS [10]. The 427 different “types” of carbon atoms, which are all availability of these forcefields can significantly reduce differentiated based on their chemical context. Further- or completely eliminate the difficult and costly task of more, determining which of the multitude of forcefield determining the interactions between species, allowing parameters is most appropriate, based on the chemical This work is licensed under a CC-BY-4.0 license. context of an atom, is often accomplished through brief, unstructured – and sometimes ambiguous – annotations not inadvertently override other rules. This may impose located in the forcefield parameter files. Even journal practical limits on functionality, where, for example, a articles associated with forcefield parameters can be user is not able to easily extend the rules to include newer unclear, where parameters are typically listed in table parameters, or that such attempts to extend rules result form, often with limited annotations or examples of in incorrect atom-typing for other molecular species. their usage. Also, given their static nature, parameters To address these issues, we are developing a new provided in journal articles may not be the most up-to- framework for atom-typing, based upon first order logic date, whether resulting from typographic errors or modi- over graph structures. The novelty of our approach lies fications in later work. As a result, for many forcefields, in the declarative annotation syntax that allows for 1) the formal logic required to distinguish atom types may decoupling the definition of forcefield-specific atom- be difficult to find, difficult to interpret, out-of-date, or typing rules from how a forcefield-agnostic tool uses simply ambiguous. This ambiguity is confounded by the these annotations to automatically compute the atom- fact that many published journal articles that rely on typing of complex molecular systems, and 2) formally forcefields often provide only vague citations for the verifying and automatically validating annotated force- source of the parameters. fields. Specifically, we proposed to: Most research groups use some combination of hand • establish a forcefield agnostic (i.e., general) for- parameterization and logic based codes to facilitate the malism to express the chemical context in which process of atom-typing. Identifying atom types by hand, a particular force field entry is applicable (i.e., while easy for small/simple molecules and important for forcefield usage semantics); validation, becomes impractical for large molecules or • develop a tool suite that automates atom-typing of systems with significant heterogeneity. For molecules molecular structures using the semantics-annotated with only a few atomic species and little variability, ad forcefields; hoc, logic-based codes are relatively straightforward to • and establish a development process to create, incre- write and test. However, as the number of unique atom mentally extend, and evolve annotated forcefields, types increases, properly defining the appropriate if/else providing tools for automated verification, valida- statements required in most logic codes becomes oner- tion, and continuous integration techniques. ous, even for people who are well versed in programming An important goal of the project is to disseminate and have domain expertise in molecular modeling. Also, results and to foster community involvement through nested logic statements are often difficult to adequately the creation of a centralized online forcefield repository test and debug and thus introduce a potential source containing the formally annotated forcefields, documen- of error. Since ad hoc codes are not typically released tation of how to annotate forcefields, along with files to the general community, any errors in the code may for benchmarking and validation of the atom-typing go undiscovered and thus data based on flawed atom- software. Given the importance of accurately applying typing may appear in published scientific literature. To forcefields to molecular simulation, the new atom-typing address this, several community tools and approaches framework developed in this work has the potential have been developed to aid in atom-typing [15]–[20], to significantly impact the community, affording re- some of which are more generally applicable than others. searchers greater confidence in the model parameters Forcefields developed in the biophysics community tend used in their studies, eliminate the need to develop ad to have exceptionally well vetted parameterization codes, hoc atom-typing codes, make forcefield parameter usage such as AMBER’s antechamber [21], but such tools are clearer, and significantly reduce incorrect atom-typing as typically designed to only work with their associated a source of error and inconsistency in published results. forcefield and may also produce output specific to a given simulation package, rather than a general form. In II. Vision general, these tools all rely on a hierarchy where rules In our vision, forcefield definitions are not just sets of that identify more specialized atom types must be called tables with numerical data, but as unambiguous, well- in precise order [21], such that more general atom types structured documents with rich metadata, including the are only chosen when more specialized matches do not formal description of the chemical context in which the exist (i.e., they include rule precedence). Maintaining, let particular parameter values can be applied. In the future, alone constructing, these hierarchies is extremely error the tedious and error prone task of manually atom-typing prone and, just as in ad hoc codes, typically results and parameterizing complex molecular models (inputs in source code with deeply nested if/else statements. to simulators) will be eliminated, and replaced by the In these hierarchical schemes, in order to add a new automated process that relies on the machine-readable atom type or correct an error, a developer must have forcefield usage semantics. a complete picture of the hierarchy and know exactly We envision forcefields as dynamic data entities where the relevant rule should be placed, such that it does that evolve over time: parameterizations of additional chemical species are added, already supported chem- evaluates to true if and only if the chemical species ical species are specialized, and parameter values are of the atom is carbon (here, C is a built-in type); tweaked for better modeling of the chemical interactions. bonded atoms() evaluates to the set of all atoms bonded Forcefield definitions will be maintained by online com- to the one of interest, which can be further filtered munities and hosted at shared repositories with version by predicates. For instance, bonded atoms(type!=C) control capabilities. These online repositories will also would evaluate to a set of all non-carbon bonded atoms. support continuous integration (CI), i.e., automated ver- The proposed DSL will also supports functions on sets, ification, validation, and testing of the forcefields as they such as count(), and common set operations such as evolve. union, intersection, difference, etc. Also, the formalism We believe that the developers of forcefield definitions will support the existential and universal quantifiers deserve credit for their efforts. The online repositories (exists() and f or all()) to allow evaluating first-order will generate permanent URL links per forcefield ver- logic statements over sets. Furthermore, the language sions, as well as document object identifiers (DOI) which will include support, through built-in functions, to ex- allow unambiguously referencing these data artifacts, press common molecular structures that are too verbose and properly citing them in scientific publications. to express otherwise, such as, for instance, rings of a particular size. III. Approach For convenience, we will allow the annotations to The work to be carried out can be roughly broken down reference user-defined types in the language (not just into two main efforts: (1) the development of the atom- the built-in chemical species, such as C and F). The typing framework, including the formalism to express statement forcefield usage semantics, and (2) examination of case C791 : type = C & studies designed to test, validate, and refine the atom- typing framework. We note that these efforts will be count(bonded atoms(type = F )) = 3 & (3) executed concurrently, to provide a continual loop of count(bonded atoms(type = C792 )) = 1 development, testing, and refinement. would specify that the carbon atom at the end of the A. Annotation Syntax fluorocarbon chain must have a C729 type neighbor, One of the main goals of the proposed effort is to which is a carbon in the fluorocarbon backbone. This is design a domain specific language (DSL) for annotating clearly a more restrictive atom type usage specification forcefield parameters. The DSL will be used to express than Eq. 1, allowing it to more specifically express the the chemical context in which the particular atom type is chemical context. applicable. This DSL will, effectively, serve as a means It is crucial that the annotation syntax we propose be to unambiguously document the forcefield. The syntax future-proof and support the evolution of forcefields. we propose will be expressive, unambiguous, and both Evolution can mean two things: a.) the forcefield gets human and machine readable. extended with support for new chemical species, or b.) Consider the following example of tagging the carbon already supported species get more specialized. Annotat- atoms in the perfluorobutane (shown in Fig. 1) with ing the atom types for the newly added chemical species its usage semantics to create the DSL. Atom type 791 can trivially be done incrementally. However, when an in TINKER’s OPLS-aa parameter database [14] corre- existing atom type is specialized, other existing atom- sponds to a terminal carbon of a perfluorobutane chain. typing rules referencing the specialized one can also be This will be annotated with the following statement: affected. For instance, let us assume that we want to C791 : type = C & distinguish between C791 -type fluorocarbon end groups based on what kind of carbon they are bonded to. Let us count(bonded atoms(type = F )) = 3 & (1) assume that to do this, we would remove the C791 atom count(bonded atoms(type = C)) = 1. type from the forcefield and replace it with C791A for This means that atom type C791 is a carbon (C) atom, a chemical context when the carbon neighbor is part of which can be used in a chemical context where it has a fluorocarbon backbone (C791 ), and with C791B if it is 3 bonded fluorines (F ), and one carbon. Similarly, atom not. Since the generic C791 type has been removed from type 792, applicable to carbon atoms that are part of a the forcefield, all existing annotations that reference it fluorocarbon backbone, would be annotated as need to be changed to reference C791A and/or C791B . C792 : type = C & We want to avoid this, because it would make ex- tending annotated forcefields a laborious and error-prone count(bonded atoms(type = C)) = 2 & (2) task, which would hinder the wide-spread acceptance count(bonded atoms(type = F )) = 2. of our proposed formalism, and would jeopardize the Notice that the above annotations are logic statements success of our efforts. To tackle this issue, the anno- consisting of predicates over the topology: type = C tation language will allow multiple atom types to be assigned to a given atom in a molecule, but the atom handle molecules with ring structures, a global invariant type annotations will be required to explicitly express can state that: the specialization relations (i.e., that atom type C792A f or all(type = C, !in ring()). (6) overrides C792 ). This way, if multiple atom types are applicable to particular chemical context, it is the most We note that as part of defining the DSL for annotating specialized one that will get assigned to the given atom. the forcefield, we will develop various “helper” tools to The following example demonstrates a possible syntax ensure proper syntax usage. to achieve that, by including the definition of a more B. The Atom-Typing Tool generic perfluoroalkane carbon, CP F A . Our approach differs from existing atom-typing tools in CP F A : type = C & a number of ways. First, existing tools are forcefield- count(bonded atoms()) = 4 & specific, while the proposed atom-typer is forcefield- f or all(bonded atoms(), type = Cktype = F ) agnostic. Second, the common practice is to hard-code the atom-typing logic into the tool’s source code, which C791 : type = CP F A & makes the code hard to extend as the forcefield evolves. count(bonded atoms(type = F )) = 3 & The formalism we propose factors out the atom-typing count(bonded atoms(type = C)) = 1 logic into declarative annotations, which, instead of @overrides(CP F A ) describing how atom typing rules are executed, states the invariants of the chemical context that must hold C791A : type = C791 & true for a given atom type or parameters. The proposed count(bonded atoms(type = C792 )) = 1 atom-typer will interpret this declarative formalism, and @overrides(C791 ) compute the atom type assignments that satisfy the C791B : type = C791 & invariants. Third, existing tools often do not cover all chemical contexts supported by the forcefield, but do !exists(bonded atoms(), type = C792 ) produce some (potentially bogus) output when run on @overrides(C791 ) topologies that include such contexts. Our tool relies on (4) a three-pronged approach to alleviate such problems: 1) Support for overriding annotations is important for formal verification of forcefield annotations that reveal two reasons. First, it allows for incremental development omissions and contradictions, 2) global invariants in the of forcefields, and second, the overridden, more general forcefield that map to assertions on the input topologies, atom types can be used as wildcards in references. Con- warning the user of unsupported molecular features, and sider the following annotation of a torsional term defined 3) a proposed test suite and continuous integration setup over four carbon atoms on a fluorocarbon backbone that validates the annotated forcefields against known, (bonds, and angles are annotated similarly): correctly atom-typed topologies. An important goal of the proposed effort is to lever- C13 C13 C13 C13P F A : (CP F A , C792 , C792 , CP F A ) age the annotated forcefields to automatically atom- (5) type complex molecular systems. The description of the This annotation defines that the torsional term molecular system must include at least the chemical C13 C13 C13 C13P F A is applicable to a series of species of the atoms (element names or atomic numbers), four carbon atoms such that the first and the last one and their connectivity (bonds), both of which could be can be any kind of carbon in a fluorocarbon molecule, provided, for instance, from our previously developed but the two middle atoms must be of a more specific mBuild tool [22], [23]. The atom-typer tool we propose type, C792 , with exactly two carbon neighbors. This to develop will read this topology, along with an anno- allows us to uniquely identify bonded interactions for tated forcefield specification, and will produce an output atom-types that are of the same “atom class.” The DSL topology with the forcefield specific atom types, as well can also be extended to provide other parameters that as bonded interaction types (bonds, angles, torsions, etc.) may be required, for example, a molecular length option and associated parameters, which can be used as input to unique identify instances when a specific torsion to a molecular simulator. parameter should be used. For the atom-typer tool, we will investigate the fol- Apart from annotating atom types and various bonded lowing implementation approaches: interaction types (bonds, angles, torsions) in the force- Naive approach. A naive algorithm first orders the field, the proposed language supports global invariants, atom type annotations according to the “referenced-by” as well. These invariants can be used to express con- relationship, that is, if annotation C791 references anno- straints on the applicability of the forcefield as a whole, tation C792 , then C791 will precede C792 in the order. and are essential to prevent force field use on unsup- Then, following this ordering, the algorithm evaluates ported topologies. For example, if the force field cannot all atom-type annotations for all atoms in the input topology. This ensures that all referenced atom-typing mapped to Datalog, what are the performance impli- rules are evaluated before those that reference them. If cations, including Datalog query execution scales with multiple annotations evaluated to true for a given atom the topology size and the number of atom types in the in the input topology (e.g. CP F A , C791 , and C791A ), the forcefield. most specialized type (C791A ) will be chosen, following Subgraph isomorphism. We expect that many (but the @overrides relations of the annotations. Obviously, not all) of the forcefield annotations will, implicitly, this naive approach would fail if no partial ordering describe the chemical context as a subgraph. While the of atom type annotations would exist, i.e., if there are subgraph isomorphism problem, in the general case, is annotations that mutually reference each other, or if NP-complete [27], this is not the case with the problem circular dependencies exist. of finding where a particular chemical neighborhood is Fixpoint-based method. Allowing recursion in anno- present in the overall topology [28]. This is due to the tations would make the language more expressive. But in fact that chemical topologies belong a special class of order to accommodate recursive annotations, we need to graphs (planar graphs) where the maximum number of tweak the naive approach. First, the fixpoint-based solver bonded atoms is well defined (i.e., never more than 4 for evaluates the primitive rules, i.e., the ones that do not covalently bonded systems), and that the chemical neigh- reference others, for every atom in the topology. If an borhoods in question tend to be relatively small. State of annotation evaluates to true (on any atom), it enables the art graph matching solutions that employ a plethora those annotations that reference it. When the enabled of optimization techniques and exploit parallelism today annotation is then evaluated, it, in turn, may enable or re- can scale up to graphs with 109 nodes [29]. enable others. This continues until no further annotations We suspect that for certain kinds of annotations, are enabled (that is, a fixpoint is reached), which is the especially those involving ring structures, a subgraph termination condition of the iteration. matching based approach would provide better perfor- Logic programming. Logic programming lan- mance than directly mapping such atom type annotations guages [24], such as Prolog [25] or Datalog [26], that to, for instance, Datalog rules. Therefore, we envision have been designed for deductive reasoning, are partic- that the final version of the atom-typing tool will borrow ularly useful in this context. Logic programs consists of from multiple of the above mentioned approaches. facts, i.e., things that always hold true (e.g., “Mickey It is important to note that the declarative nature of is a mouse”, “Pluto is a dog”, “Mars is a planet”), the proposed annotation language allows us to decouple and deduction rules that can be used to define relations the forcefield specification (what statements must hold (e.g., “a mouse is an animal”, “a dog is an animal”). true for a correctly atom-typed topology) from execution The program is then run by the user posting queries (how it is achieved). That is, the particular execution (e.g., “list all animals”), which the interpreter runtime approach we will eventually choose is orthogonal to executes and evaluates (returning Mickey and Pluto in the forcefield annotation syntax, so the implementation this example). By representing the “bonded-to” relations of the atom-typer tool can evolve independently of the of the input topology as tuples, and evaluating certain forcefield annotations. non-logic functions in advance (e.g., enumerating ring C. Automated Verification, Validation, and Testing structures), we can encode them as facts in a logic pro- gram (e.g., “atom1 is a carbon”, “atom2 is a hydrogen”, Annotating a forcefield with atom-typing semantics is “atom1 is part of ring1”, “atom1 is bonded to atom2”, a major undertaking, and it is inevitable that we make etc.). Similarly, the annotations of the forcefield’s atom errors on the way. The same is true for a complex piece types will be mapped to deduction rules, (e.g., “a carbon of software, such as the atom-typer tool. We believe that atom with 4 bonded hydrogens is a methane carbon”). through proper testing, verification, and validation, we Then, the logic program can be run by executing queries can build quality forcefield annotations and tools that to list atoms with each atom type (e.g., “list all methane either produce the correct atom-typing results, or fail carbons”). with adequate warnings or error messages. Our approach is unique in that it provides two very distinct types of An important difference between Prolog and Datalog testing: 1) verification of the underlying rules to eval- is that while Prolog is an expressive, general purpuse uate inconsistencies and 2) validation of the outputted logic programming language, Datalog is not Turing- molecular models. complete, and mostly focuses on reasoning about data. Also, Datalog imposes restrictions on the the use of 1) Verification negation and recursion (Prolog does not), however, Dat- With verification, we want to answer the question: “Are alog queries are always guaranteed to terminate (while we annotating the forcefield correctly?” The fact that our in Prolog, there are no such guarantee). proposed annotation syntax can be mapped to first-order In the proposed effort, we will investigate how force- logic statements makes it possible to “reason about” the field usage specifications and input topologies can be annotated forcefield as system of atom-typing rules, even without applying them to molecular topologies. That is, We propose to develop a validator tool that does just in logic programming terms, we can reason about the this this: takes an annotated forcefield as an input, and rules without the facts. We cannot emphasize enough iterates through a large number of correctly atom-typed the importance of formal verification here. It may reveal topologies ensuring that the atom-type annotations and subtle and latent errors in the annotation logic of the global invariants are never violated. If the validator finds forcefield that would be only possible to detect through a topology that is in contradiction with the forcefield thorough testing by running the atom-typer on a large annotations, it is the chemical scientist who needs to swath of different topologies. revisit and correct the forcefield, rather than a source Are there rules that are not decidable? Are there any code issue. annotations that will never evaluate to true, irrespective 3) Testing The Atom-Typer Tool of the input chemical topology? Can it ever happen that two rules may hold true at the same time, without one The development and testing of the atom-typer tool is overriding the other? Are there any atom-typing rules much alike that of any software. Importantly, it can be that are in contradiction with the global invariants? If carried out without chemical domain expertise. This is the answer to any of the above questions is yes, it analogous to developing and testing a database server: indicates an error in the logical structure of the forcefield the software developers are not concerned with what annotations. We propose to verify such properties on the data, in what schemas, will the users store in the system of forcefield annotations, and will provide a set database, but rather focus on testing the functionality that of development tools that help the forcefield developers implements how queries are answered. In the proposed pinpoint these inconsistencies early in the forcefield effort, we will use state-of-the-art software testing tools, development process. We expect that some of the above including unit tests packaged into test suites, as well violations will be detectable by the Datalog interpreter as coverage tools to quantify the degree to which the after the annotations are mapped to Datalog syntax, source code is tested. The source code of the tools will while others, pertaining to ensuring boolean satisfiability, be stored on GitHub [31], leveraging its collaborative will require us to integrate a theorem prover such as the development, version control, and issue tracking facili- widely used Z3 solver from Microsoft Research [30]. ties. Our proposed goal is to prove that the forcefield is D. Continuous Integration complete in the logical sense, that is, if the forcefield’s Continuous integration (CI) is a software development annotations evaluate to “true” on a correctly atom-typed practice that encourages members of the development topology, then the atom-typing tool is guaranteed to team to integrate code into a shared repository several compute the correct atom-typing, given the annotated times a day. On each check-in, the software is au- forcefield, the chemical species of the atoms and their tomatically built and tested, allowing teams to detect connectivity as inputs. We will strive to reach this goal, problems early. Commonly, CI is provided as a cloud even at the cost of reducing the expressiveness of the service. Developers set up a project with a CI service annotation language. provider (e.g., Travis CI [32], CodeShip [33], etc.), and 2) Validation of Forcefield Annotations it is the CI service that watches the project’s source code With validation, we want to answer the question: “Did repository (e.g., GitHub) for changes, attempts to build we come up with the correct annotations?” If we and test the code in virtual machines in the cloud, and run the atom-typer on a molecular topology and it does reports build and test results to the developers. not produce the expected results, it is vital to know – In the proposed effort, we will apply the CI approach particularly for an interdisciplinary team of chemical to the development process of both the annotated force- scientists and computer engineers – whether the error fields and the corresponding software tools (atom-typer, is in the forcefield annotations or in the atom-typer’s verifier, validator, etc.). For the software artifacts, CI source code. With validation, we want to focus on will be hosted at Travis CI. For the automated verifi- errors in the forcefield annotations, without having to cation and validation (V&V) of annotated forcefields, worry about potential software bugs in the atom-typer’s we propose to develop our own cloud service, based on implementation. BuildBot [34], an open-source framework for automating Notice that if an annotated forcefield passes verifica- software build, test, and release processes. The proposed tion (i.e., it is complete in the logical sense), we can forcefield CI service (see Fig. 2) will watch the repos- validate it in an isolated way, without having to execute itories where the annotated forcefield files are hosted. the atom-typer tool. It is sufficient to check that the Also, it will integrate with existing online repositories forcefield’s annotation statements hold true on a set of that host correctly atom-typed molecules, which will be correctly atom-typed topologies (test cases), which will automatically downloaded and used as the “ground truth” entail, due to the completeness property, that the atom- for forcefield validation. V&V of annotated force field typer will be able to compute the correct atom-typing. files will be triggered when either a.) changes in the Fig. 2. Continuous integration workflow of annotated forcefield development. repository are detected b.) new correctly-typed molecules specialize, and become more complex. The machine- are added. readable annotations of forcefield usage semantics will E. Incentivizing Community Involvement enable automation of tedious and error prone tasks, and will enable new application areas, ranging from Although there exist some meticulously well maintained automated forcefield comparison and cross-validation, and “alive” forcefield repositories, that are, not surpris- to complex simulation workflows integrating multiple ingly, typically specific to a particular simulator tool forcefields and simulator tools. Through offering our or chemical or biomolecular domain, cataloging force- verification, validation, and testing infrastructure as a field development of the past decades into a coherent, free-of-charge continuous integration service to the com- unified, searchable online forcefield database would be munity, we believe that our approach has the potential to an enormous undertaking, which is not feasible without foster community involvement through the creation of a community involvement. online forcefield repository for the annotated forcefields The implementation of such an online database would that tagged with DOIs for proper referencing and attribu- not be technically challenging. Neither would be operat- tion, the associated software, and documentation of our ing and maintaining such a service. However, convincing framework. the developers of new forcefields to upload their param- eter sets, to correctly tag them with how and in what Acknowledgments context the values are applicable, to supply test cases, This work is supported by the National Science Foun- etc., would surely be a futile attempt. dation under grant number ACI 1535150. It is well known it is notoriously hard to get outside users to upload their work to a repository. A novel aspect References of our community building approach is that we will [1] J. I. Siepmann, S. Karaborni, and B. Smit, “Simulating the critical behaviour of complex fluids,” Nature, vol. 365, incentivize our users to do so by giving them free access pp. 330–332, 1993. to our continuous integration infrastructure: outside users [2] S. Auer and D. Frenkel, “Prediction of absolute will register their own repositories with the forcefield crystal-nucleation rate in hard-sphere colloids.,” Nature, vol. 409, pp. 1020–1023, 2001. CI service, which will provide automatic, continuous [3] A. Haji-Akbari, M. Engel, A. S. Keys, X. Zheng, R. G. verification, validation, and testing for their forcefields Petschek, P. Palffy-Muhoray, and S. C. Glotzer, “Disordered, at no cost. quasicrystalline and crystalline phases of densely packed tetrahedra.,” Nature, vol. 462, pp. 773–777, 2009. Whenever a new commit or pull request adds or [4] G. Feng and P. T. Cummings, “Supercapacitor capacitance modifies forcefield related files in the users’ registered exhibits oscillatory behavior as a function of nanopore size,” repositories, the CI service will trigger the verification Journal of Physical Chemistry Letters, vol. 2, pp. 2859–2864, 2011. and validation workflow. Once a forcefield (or a new [5] S. J. Weiner, P. A. Kollman, D. A. Case, U. C. Singh, C. Ghio, version of a forcefield) passes all tests, the CI service G. Alagona, S. Profeta, and P. Weinerl, “A New Force Field for will publish it (that is, the annotated parameter set, Molecular Mechanical Simulation of Nucleic Acids and Proteins,” Journal of the American Chemical Society, vol. 106, its source URL, and the test results) to our online pp. 765–784, 1984. forcefield repository, assigning to it a permanent URL [6] A. D. MacKerell, N. Banavali, and N. Foloppe, “Development and a document object identifier (DOI) for referencing and current status of the CHARMM force field for nucleic acids,” Biopolymers, vol. 56, pp. 257–265, 2000. and attribution purposes. [7] W. L. Jorgensen, D. S. Maxwell, and J. Tirado-Rives, “Development and Testing of the OPLS All-Atom Force Field IV. Conclusion on Conformational Energetics and Properties of Organic Through the development of a new formalism for chemi- Liquids,” Journal of the American Chemical Society, vol. 118, cal context and novel atom-typing scheme, our approach pp. 11225–11236, Jan. 1996. [8] J. J. Potoff and J. I. Siepmann, “Vapor-liquid equilibria of unambiguously describes the appropriate usage of force- mixtures containing alkanes, carbon dioxide, and nitrogen,” field parameters and helps to reduce atom-typing as a AIChE Journal, vol. 47, pp. 1676–1682, 2001. source of error during model development. Developing [9] H. Sun, “COMPASS: An ab Initio Force-Field Optimized for Condensed-Phase Applications s Overview with Details on this framework will simplify the rules needed for atom- Alkane and Benzene Compounds,” Journal of Physical typing, which is crucial as forcefields continue to grow, Chemistry, vol. 5647, pp. 7338–7364, 1998. [10] C. Oostenbrink, A. Villa, A. E. Mark, and W. F. Van (Philadelphia, PA, USA), pp. 632–640, Society for Industrial Gunsteren, “A biomolecular force field based on the free and Applied Mathematics, 1995. enthalpy of hydration and solvation: The GROMOS force-field [29] Z. Sun, H. Wang, H. Wang, B. Shao, and J. Li, “Efficient parameter sets 53A5 and 53A6,” Journal of Computational subgraph matching on billion node graphs,” Proceedings of the Chemistry, vol. 25, pp. 1656–1676, 2004. VLDB Endowment, vol. 5, pp. 788–799, May 2012. [11] R. C. Rizzo, J. Tirado-Rives, and W. L. Jorgensen, “Estimation [30] L. De Moura and N. Bjø rner, “Z3: An efficient SMT solver,” of binding affinities for HEPT and nevirapine analogues with in Tools and Algorithms for the Construction and Analysis of HIV-1 reverse transcriptase via Monte Carlo simulations,” Systems, pp. 337–340, Springer Berlin Heidelberg, 2008. Journal of Medicinal Chemistry, vol. 44, pp. 145–154, 2001. [31] Http://github.com, “GitHub: powerful collaboration, code [12] M. G. Martin and J. I. Siepmann, “Transferable Potentials for review, and code management for open source and private Phase Equilibria. 1. United-Atom Description of n -Alkanes,” projects.” The Journal of Physical Chemistry B, vol. 102, no. 97, [32] Http://travis-ci.org, “Travis CI: A hosted continuous integration pp. 2569–2577, 1998. service.” [13] E. K. Watkins and W. L. Jorgensen, “Perfluoroalkanes: [33] Http://codeship.com, “Codeship: A free hosted Continuous Conformational Analysis and Liquid-State Properties from ab Delivery Service.” Initio and Monte Carlo Calculations,” The Journal of Physical [34] Http://buildbot.net, “Buildbot: An open-source framework for Chemistry A, vol. 105, pp. 4118–4125, 2001. automating software build, test, and release processes..” [14] J. W. Ponder, “TINKER Molecular Modeling Software.” [15] B. L. Bush and R. P. Sheridan, “PATTY: A Programmable Atom Typer and Language for Automatic Classification of Atoms in Molecular Databases,” Journal of Chemical Information and Computer Sciences, vol. 33, pp. 756–762, 1993. [16] A. W. Schüttelkopf and D. M. F. Van Aalten, “PRODRG: A tool for high-throughput crystallography of protein-ligand complexes,” Acta Crystallographica Section D: Biological Crystallography, vol. 60, pp. 1355–1363, 2004. [17] A. A. S. T. Ribeiro, B. A. C. Horta, and R. B. De Alencastro, “MKTOP: A program for automatic construction of molecular topologies,” Journal of the Brazilian Chemical Society, vol. 19, no. 7, pp. 1433–1435, 2008. [18] A. K. Malde, L. Zuo, M. Breeze, M. Stroet, D. Poger, P. C. Nair, C. Oostenbrink, and A. E. Mark, “An Automated force field Topology Builder (ATB) and repository: Version 1.0,” Journal of Chemical Theory and Computation, vol. 7, pp. 4026–4037, 2011. [19] K. Vanommeslaeghe and a. D. MacKerell, “Automation of the CHARMM General Force Field (CGenFF) I: bond perception and atom typing.,” Journal of Chemical Information and Modeling, vol. 52, pp. 3144–54, Dec. 2012. [20] J. D. Yesselman, D. J. Price, J. L. Knight, and C. L. Brooks, “MATCH: an atom-typing toolset for molecular mechanics force fields.,” Journal of Computational Chemistry, vol. 33, pp. 189–202, Jan. 2012. [21] J. Wang, W. Wang, P. A. Kollman, and D. A. Case, “Automatic atom type and bond type perception in molecular mechanical calculations.,” Journal of Molecular Graphics & Modelling, vol. 25, pp. 247–60, Oct. 2006. [22] C. Klein, J. Sallai, C. R. Iacovella, C. McCabe, and P. T. Cummings, “Mbuild: A Hierarchical, Component Based Molecule Builder,” in Poster Presentation: Computational Molecular Science and Engineering Forum, American Institute of Chemical Engineers Annual Meeting, Atlanta, GA, November 18, 2014. [23] J. Sallai, G. Varga, S. Toth, C. Iacovella, C. Klein, C. McCabe, A. Ledeczi, and P. T. Cummings, “Web- and Cloud-based Software Infrastructure for Materials Design,” Procedia Computer Science, vol. 29, pp. 2034–2044, 2014. [24] J. W. Lloyd, “Foundations of logic programming; (2nd extended ed.),” Jan. 1987. [25] W. Clocksin and C. S. Mellish, Programming in PROLOG. Springer Science & Business Media, 2003. [26] S. Ceri, G. Gottlob, and L. Tanca, “What you always wanted to know about Datalog (and never dared to ask),” IEEE Transactions on Knowledge and Data Engineering, vol. 1, pp. 146–166, Mar. 1989. [27] S. A. Cook, “The complexity of theorem-proving procedures,” in Proceedings of the Third Annual ACM Symposium on Theory of Computing, STOC ’71, (New York, NY, USA), pp. 151–158, ACM, 1971. [28] D. Eppstein, “Subgraph isomorphism in planar graphs and related problems,” in Proceedings of the Sixth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA ’95,