How can semantic annotations support the identification of network similarities? Christian Rosenke and Dagmar Waltemath Department of Computer Science, University of Rostock christian.rosenke|dagmar.waltemath@uni-rostock.de Abstract. Computational models in open model repositories support biologists in understanding and investigating biological questions. The availability of alternative models results in a need for model selection algorithms. This selection can be based on information retrieval search, full-text search, selection by ontology concepts etc. We here describe our approach to solving a serious aspect of model selection, that is, the problem of comparing models of reaction networks. Specifically, we dis- cuss how graph algorithmic approaches can help to compare models that are semantically enriched by annotations. While a graph comparison of naked models is infeasible, the knowledge gained from the semantic an- notations and domain specific structures can reduce the complexity. Our concept has the potential to improve model search, and it can contribute to the definition of similarity measures. 1 Introduction Modeling is a state-of-the-art method in systems biology research, and its impor- tance for experimental studies and result analyses is steadily increasing. Compu- tational models describe biological systems, which are analysed and often simu- lated, to gain further knowledge about a system under study, or to predict future experiments. Usually, models are provided to the research community in XML standard formats such as SBML [1] or CellML [2]. These XML-based encodings are annotated with terms from bio-ontologies [3] to enable model visualisation, comparison and search, and thereby improve model reuse. See Figure 1 for an example of an SBML encoded and annotated model on the cell cycle. Models are distributed via open repositories such as the BioModels Database [4], JWS Online [5], or the CellML model repository [6]. Understanding and further developing models remains a major challenge in biology today due to the ambiguity and varying levels of abstraction in model de- scriptions. Other aspects that complexify the study of models are the increasing number and size of models [7]. Even for well-annotated models, basic opera- tions, such as identifying suitable submodels and comparing models, can hardly be handled with or without computational support. In this work, we outline a path towards efficient computational methods that assist scientists in finding models and comparing them to each other. Many mod- els are described by reaction networks, which are built from nodes for reactants, translation ∅ 0006412 BRENDA cyclin GO cellular protein protein protein total dephospho- metabolic metabolic rylation process process cdc2 00070 0019538 0044267 Cyclin- dependent p-cyclin Kinase cdc2-P 2.7.11.22 protein phos- cellular protein p-cyclin phorylation protein modification cdc2 0006468 modification process process 0036211 Protein-serine 0006464 / threonine Kinase cdc2k-P 2.7.11 biological Transferase process 2 0065007 cdc2k Transferring regulation of regulation phosphorus- cyclin-dependent of protein containing p-cyclin protein ki- modification Groups nase activity process 2.7 0000079 00031399 ∅ Fig. 1. The cell cycle’s model by Tyson [8] from the BioModels Database http://www.ebi.ac.uk/biomodels-main/BIOMD0000000005 coming with semantic in- formation via annotation links to the ontologies BRENDA (blue) und GO (red). The biological system is described by reactions (green nodes) and chemical entities (yellow nodes) which are connected by directed arcs to express the roles of reactants, products, and modifiers. Ontology terms (blue and red nodes) are given together with their id and hierarchy in the respective ontology. reaction products and modifiers. The goal is to extend current search functional- ity by a model similarity measure that compares these network structures using graph algorithmic approaches. Similar works have already developed algorithms that calculate model sim- ilarities, as for instance in. Proposed solutions either use information retrieval to calculate similarities between models [7]; XML diff algorithms for difference detection within versions of a model [9]; focus on the similarities of semantic annotations of model entities [10]; or use network similarity approaches [11,12]. However, none of them fully integrates available, domain-specific information and graph-based approaches in one similarity measure. Moreover, existing algo- rithms provide heuristic or approximate solutions that do not yet represent real alternatives for the still common manual way of processing models. To master the demanding challenges introduced by the contemporary way of working with models, we incorporate knowledge about domain characteristics and so-called semantic annotations, which have so far not been considered together with the structural composition of networks. 2 Results This work is tailored towards models encoded in standard representation formats as SBML and CellML. The key concepts to counter the severe computational hardness of comparing “naked” graphs are to incorporate (1) information on structural constraints common to networks of biological models and (2) knowl- edge from semantic annotations. Incorporating the model structure can tremendously reduce the com- plexity of the proposed graph-based correlation procedures. This is because net- works in models for biological systems are subject to certain structural restric- tions simply by originating from real world chemical and physical processes. Chemical reactions, for instance, do almost never incorporate more than two or three reactants and products. Beside this general observation, more complex limits to biological reaction networks can be found. For example, certain pattern, called motifs [13], tend to appear regularly in many biological networks. Incorporating semantic annotations enhances or reduces the probability of matching nodes based on the linked ontology concepts. Many models are accurately annotated, including information about the biological meaning of model entities, about the roles that entities play in reactions, about the types of reactions, or about the nature of a parameter. We use these links to concepts in external ontologies to determine similarities between model entities. If two entities carry the same, or a similar annotation, and they structural restrictions apply, then our algorithm boosts the respective similarity value. 3 A First Graph-Based Approach using Annotations To support our argumentation in this work we present a first graph-based ap- proach where knowledge about semantic annotations is used to obtain an efficient solution. More precisely, we want to efficiently answer the question, if a query model Q is a submodel of another model M . Thus we look for an injective map- ping σ of the nodes from Q to those from M such that the edges are matched. Computationally, this is extremely difficult in general. But using the links into the ontologies, made available by the semantic annotations, can create valu- able connections between Q and M which essentially reduce the possible degree of freedom for the mapping σ. Our approach works, if nearly every node in Q has at most two possible candidate target nodes in M . Although this is a strong restriction, our algorithm is more general and powerful than many other approaches. Basically, we compute a formula in 2-cnf that describes the valid mappings σ by its satisfying assignments, that is, we reduce the problem to 2-SAT, a tractable version of the satisfiablity problem from proposal logic. For every node q from Q and every possible target node m of M we introduce a variable mq which is true if and only if σ(q) = m. As q has at most two targets m, m0 we are able to include the clauses (mq ∨ m0q ) and (¬mq ∨ ¬m0q ) to express that σ(q) is either m or m0 . To make σ injective, we have to state for every node m in M that at most one node from Q maps to m. For this end, we add the clause (¬mq ∨ ¬mq0 ) for every pair q, q 0 of Q-nodes that both possibly map to m. As σ has to preserve the adjacency of nodes, we add the clause (¬mq ∨ ¬m0q0 ) for all Q nodes q, q 0 with possible target nodes m, m0 if the adjacency between q and q 0 in Q is different from the adjacency between m and m0 in M . The length of the resulting 2-cnf formula is at most quadratical in the size of Q and widely independent of M . This easily makes queries Q of up to hundreds of nodes feasible. 4 Summary This work presents first thoughts on using a combination of algorithms for graph similarity and semantic annotations as a mean to map simulation models describ- ing biological systems. The concept will be implemented in a data management system that supports the management and linking of model files, ontologies used to semantically enrich the model files, and all data that needs to be accessed and stored during and after the computation of network similarities [7]. Our concept for model comparison can improve model search, and it can contribute to the definition of similarity measures. It can also be used to identify overlapping parts in a network, for example between different versions of a model [9]. References 1. Hucka et al.: The systems biology markup language (SBML): a medium for repre- sentation and exchange of biochemical network models.. Bioinformatics 19.4:524- 531, 2003. 2. Cuellar et al.: An overview of CellML 1.1, a biological model description language. Simulation 79.12, 2003. 3. Horridge et al.: The state of bio-medical ontologies. 2011. 4. Li et al.: BioModels Database: An enhanced, curated and annotated resource for published quantitative kinetic models. BMC Systems Biology, 2010. 5. Brett and Snoep: Web-based kinetic modelling using JWS Online Bioinformatics 20.13:2143-44, 2004 6. Lloyd et al.: The CellML Model Repository. Bioinformatics, 2008. 7. Henkel et al.: Ranked retrieval of computational biology models. BMC bioinfor- matics 11.1:423, 2010. 8. Tyson: Modeling the cell division cycle: cdc2 and cyclin interactions. Proc. Natl. Acad. Sci. 88(16):7328-7332, 1991. 9. Waltemath et al.: Improving the reuse of computational models through version control. Bioinformatics 29.6:742-748, 2013. 10. Schulz et al.: Propagating semantic information in biochemical network models. BMC Bioinformatics 13(1), 2012. 11. Pržulj: Biological network comparison using graphlet degree distribution. Bioin- formatics 2(23):177–183, 2007. 12. Gay et al.: A graphical method for reducing and relating models in systems biology. Bioinformatics 18(26):i575–i581, 2010. 13. Tyson and Novak: Functional motifs in biochemical reaction networks. Annual review of physical chemistry 61:219, 2010.