Looking into Reactome through Biopax Lens Laleh Kazemzadeh ∗ , Helena Deus∗ , Michel Dumontier† and Frank Barry‡ ∗ Digital Enterprise Research Institute National University of Ireland, Galway, Email : laleh.kazemzadeh@deri.org † Department of Biology Ottawa Institute of Systems Biology Ottawa, Canada Email: michel dumontier@carleton.ca ‡ National Centre for Biomedical Engineering Science National University of Ireland, Galway Email:frank.barry@nuigalway.ie Abstract—In order to understand cell behavior under different a pathway is not a trivial exercise due to the various types of conditions, the computational simulation of biological pathways components and interactions; regulation of pathways requires is of great interest. Hence, to simulate a biological pathway com- a cascade of events and interactions between genes, proteins putationally, extensive knowledge of protein-protein interactions (PPIs) in the pathway is required, along with the information and small molecules. about the generic flow of the pathway components i.e. biological In addition, there is significant cross-talk between pathways, reactions, which comprise the concerned pathway. which highlight the fact that pathways are not isolated but The popularity of Semantic Web technologies in tackling the are made up of a network of components. As such treating integrative bioinformatics challenges has increased, with various them as a system as opposed to an enclosed and self-contained approaches used to aggregate and correlate data from different sources. However the integration of publicly available pathway pathway, can support a more realistic investigation. databases, to determine the different PPIs and hence effectively simulate the cell behavior, has still various obstacles. In this II. STATE OF THE ART paper, we present a semantic approach in pathway-wise analysis of protein-protein interactions (PPIs) using Biopax standards fo- A large number of tools and applications, vocabularies and cusing particularly on Reactome database. We have identified the ontologies aimed at computationally modeling biological path- PPIs involved in a given pathway by the hierarchical extraction ways currently exist with enough precision to enable realistic of its components (complexes, proteins, small molecules). We simulations of its processes and determination of mechanism have developed a visualization tool which automatically generates of action of various molecular compounds; examples include a visual representation of the directed graph of PPIs in any specified pathway. Our approach provides intuitive inference of the systems biology markup language (SBML) [1] and the the data by flattening the nested pathways in Reactome and their Proteomics Standards Initiative-Molecular Interaction (PSI- components instead of wrapping each layer of data in the shell MI)1 . These models and data format are also devised to deepen of outer pathway. We have also discussed that the representation and broaden our understanding of pathways. A few models of a pathway in Biopax standard format is highly complex and also keep track of semantics, i.e. they attempt to precisely and even contains redundant information. Hence tools are needed in order to facilitate the navigation and analysis of pathway unambiguously describe each compound and each interaction datasets, which have been structured in Biopax format. such that they can be interpreted by applications and thus be integrated with other models. Biological Pathway Exchange I. I NTRODUCTION (Biopax) [2] is one such data format. Biopax is a standard format for representing pathways and molecular interactions The functionality of the human body is tightly regulated by within and between pathways which has been developed with biological pathways. Basic building blocks of these pathways the aim of facilitating the process of collecting, indexing and are proteins, which act in an orchestra in order to keep sharing data [2]. Several databases hosting pathway and pro- the regulation of pathways intact. Therefore understating the tein interaction information, such as Reactome2 and Pathway dynamic of these pathways is directly dependent on under- Commons [3], are already available in this format. Informa- standing how the proteins involved in a pathway interact tion retrieved from expert-curated databases like Reactome with each other. Interaction between two proteins might be is highly valuable for scientific advancement since they are of different types e.g. activation, inhibition, and methylation. the most accurate training data sets. However, because they Analyzing biological data from a pathway perspective can rely on human curation, they suffer from limited coverage result in valuable information about the process of disease and in the amount of interactions available. Integrating such data suggest new drug discovery methods that target mis-regulation in specific pathways, thus enabling a much more precise 1 http://www.psidev.info targeting of diseases. However, computationally representing 2 http://www.reactome.org sequenceParticipant3539 sequenceParticipant3540 P36611 sequenceParticipant3541 biochemcialReaction988 biochemcialReaction989 989 sequenceParticipant3542 biochemcialReaction990 sequenceParticipant3543 Q13813 biochemcialReaction991 sequenceParticipant3544 biochemcialReaction992 Pathway336 biochemcialReaction993 biochemcialReaction994 biochemcialReaction995 biochemcialReaction996 biochemcialReaction997 sequenceParticipant3568 sequenceParticipant3569 Null sequenceParticipant3570 Fig. 1: Example of redundancy and incompleteness of data represented in Biopax level2 taken from caspase-mediated cleavage of cytoskeletal pathway. Blue box indicates the sample pathway, orange boxes represent list of biochemical reactions associated to this pathway, green boxes show sequence participant at left and right of each biochemical reaction, red boxes depict the unique Uniprot ID for each protein which each left and right of a biochemical reaction points to. warehouses in one standard format will improve the coverage the overlaps between several pathways. Furthermore, each and highlight the role of Biopax in standardization. There is biochemical reaction is described as a function of the “left” an enormous potential in using the information represented and “right” hand side of the stoichiometric equation. Fig. 1. in Biopax format to realistically address biological questions, illustrates an example of data complexity and redundancy in for example, the metabolic effects of a compound in the representing biochemical reactions involved in pathway336 cell or how certain alterations in the metabolic network (caspase-mediated cleavage of cytoskeletal). As it is mentioned can be at the root cause of diseases or drug resistance. before each biochemical reaction has left and right components The discovery and confirmation of a biologically meaningful each of which refers to unique and separate sequence partic- molecular interaction often requires the analysis of enormous ipant. However, each of these sequence participants points to amount of heterogeneous data which are typically deposited the same protein ID from UniProt database. In other word, in local databases and isolated from each other. Therefore, both left and right of a given biochemical reaction point to the considerable amount of molecular interactions are “hidden” in same protein and this increases the redundancy of the data. The this data, which can only be exposed once these results are aim of our work is to devise a tool that aggregates information integrated and recurrence of patterns indicative of interactions from this data e.g. the protein interactions and components analyzed. The data integration challenges in life science have of protein complexes in pathways. This will allow us to motivated the researchers to adapt the new integration tech- easily identify common interaction between various compo- nologies offered by Semantic Web and Linked Data. Semantic nents (proteins, complexes, etc.) across pathways, abstracting Web technologies can provide a bridge between the datasets, from the complexity of pathway representation in Biopax. The enabling the discovery of links, which are often not obvious. data analysis tools made available by Reactome are unable These bridges are often standard vocabularies and ontologies to provide this inner-pathways analysis unless pathways are developed toward improvements in knowledge discovery that nested or siblings. lead to the next challenge: the representation, application and acceptance of these standard vocabularies by the domain III. M ETHODS experts. The motivational scenario for the work presented here One typical way of querying a pathway or interaction be- is the extraction of all the molecular components that act in a tween two proteins from different online databases is through particular biological process as described by Biopax in its vari- browsing their webpage. As easy as it seems, it is time ous data sources. We have chosen Biopax firstly because it has consuming and cumbersome to go through all the databases been adapted by several databases, which provide information available manually. Instead we can query the PPIs directly in signalling pathways and secondly becasue it faciliates data from the raw data provided by the databases like Reactome integration from other sources containing protein information. and other such pathway databases. We propose an approach Biopax has been developed to capture various aspects of to overcome such problems which is explained below. signalling, regulatory and metabolic pathways. However in Fig. 2 shows an overall view of the steps, which were order to provide a descriptive solution and to cover all details taken in our approach in order to identify the protein-protein in the description of pathways, some complexity needed to be interactions pathway-wise. We downloaded the protein-protein introduced. In Biopax each pathway is constructed in the form interaction file for Homo sapiens from Reactome webpage of nested pathways which partially, but not fully, illustrate in Biopax format. This data was uploaded to our Sesame Fig. 2: Overall view of the proposed method. server3 in the form of triples. The Aggregator module has our method we were able to generate a pathway wise PPIs been developed in order to extract the components involved network which is shown and discussed below. in a pathway and break down the pathway to the level of Fig. 3 shows a small part of the network visualization complexes, proteins and molecules. generated by our tool for the Apoptosis pathway. The gen- The system provides a list of selectable pathways compat- erated network contains 60 interactions between 40 pathways, ible with the pathways names used in Reactome. The ID of representing nested pathways in Reactome, and 87 proteins the selected pathway e.g. Apoptosis or Programed Cell Death involved in inner pathways of Apoptosis. Here we show (PCD) is retrieved from the triple store by the ID Retrieval the interaction between pathway336 and pathway335, which module. The Pathway Step Retrieval retrieves the list of inner are caspase-mediated cleavage of cytoskeletal proteins and pathways (pathway-steps) forming the selected pathway. Each apoptotic cleavage of cellular proteins pathways respectively. of these pathways is segregated hierarchically in the Extraction These two pathways are part of outer pathways of Apoptotic module. execution phase and Apoptosis, which are not shown here. The extracted data from Pathway Step contains bundle of The number of identified proteins in pathway336 is 8, while relational information explaining reactions, complex blocks, the number of reported proteins for the same pathway in proteins and small molecules forming complexes. Network Reactome database is 32. The reason for these differences is Generator constructs a model in the final stage from the data that some of the reported proteins in Reactome point to the extracted in the previous step. This model is then fed to the same unique protein identifier. As an example protein P08670, network visualizer, which renders and displays the relational Vimentin, has been mentioned 7 times. Likewise Q151149 and graph between components of the pathway. In this model, the the rest of identified proteins have been reported 3 times. Our relation between each entity, complex, protein and molecule algorithm was not able to identify 3 proteins (caspase 3,6,7) in the pathway is illustrated in a directed graph where nodes in the list of 32 proteins reported in Reactome database due represent the entities, pathways, proteins and molecules and to incompleteness of the original data which was downloaded edges represent the connections between source and target from the Reactome webpage. nodes or the higher level and lower level components in a Of great interest in pathway anlysis is identification of pathway tree. protein hubs. Protein hubs are those proteins with high degree The interaction Aggregator is written in PHP using ARC24 of connectivity and more likely to be essential in the cell. package in order to query the Reactome triples. The force- Example of such a protein is shown in Fig. 4. Protein Q14790 directed graph is generated by the Data Driven Documents (caspase 8), appears to be involved in the following pathways: (d3)5 , library written in Javascripts. Fasl/DD95L signaling (pathway309), TNF signaling (path- way310), Trail signaling (pathway311), Formation of caspase IV. R ESULTS 8 (pathway312), Activation of pro-caspase 8 (pathway313) and Raw material in our approach is an input .owl file, which Apoptotic execution (pathway 334). Knowing the protein ID contains the information of any pathway in Biopax. Applying or name and assuming the protein of interest is involved in different pathways we are able to retrieve the same information 3 http://hcls.deri.org:8080/openrdf-workbench/repositories/ from Reactome search tool, however it does not give us the 4 https://github.com/semsol/arc2/wiki intuitiveness of the visualization. Querying the same protein, 5 http://d3js.org/ casapse 8, in Reactome returns more hits than the number of Fig. 3: Directed graph generated by the network visualizer. Fig. 4: Protein hub connecting six inner pathways in the Graph shows the interaction between and within two pathways. Apoptosis pathway. Pathways and proteins are shown with their unique IDs. Each edge represents the connection between pair of source and target nodes. Dark Blue: pathways, light blue: proteins, orange: VI. F UTURE W ORK catalysis. Future work will be the integration of pathways and in- teractions from other databases like BioGrid [4], MINT [5], HPRD [6] and the expansion of the query and visualization in pathways we discussed here since we limited the search only such a way that two or more pathways from different sources to the Apoptosis pathway and not all the pathways exist in can be queried and the common interactions highlighted. Reactome. Furthermore, identified interactions will be ranked based on V. C ONCLUSION the number of occurrence in the databases and the literature. In this work we were able to extract PPI associated with any ACKNOWLEDGMENT given pathway. Our visualization provides a better representa- This work has been funded by Program for Research in tion of elements involved in a pathway since it is capable of Third Level Institutions (PRTLI) Cycle 5, which is co-funded retrieving and representing data while conserving the hierarchy by the European Regional Development Fund (ERDF). in which data was originally represented. Our aim was to highlight the PPIs in the pathways hence we represented only R EFERENCES pathways and proteins in the deepest level of each pathway [1] M. Hucka, A. Finney, H. M. Sauro, H. Bolouri, J. C. Doyle, H. Kitano, A. step of an outer pathway. However the data retrieved from the P. Arkin, B. J. Bornstein, D. Bray, A. Cornish-Bowden, A. A. Cuellar, S. Dronov, E. D. Gilles, M. Ginkel, V. Gor, I. I. Goryanin, W. J. Hedley, T. triple store by Aggregator contains more information about C. Hodgman, and J. Hofmeyr, he Systems Biology Markup Language each pathway than only its components (e.g. pathway name) (SBML): A medium for representation and exchange of biochemical and with the current structure of our tool it is possible to add network models, Bioinformatics, vol. 19, pp. 524–531, 2003 [2] E. Demir, M. Cary, S. Paley, K. Fukuda, C. Lemer, I. Vastrik, G. Wu, P. an extra layer of data to the Network Generator and create a D’Eustachio, C. Schaefer, J. Luciano, F. Schacherer, I. Martinez-Flores, visual representation of the extended network including e.g. Z. Hu, V. Jimenez-Jacinto, G. Joshi-Tope, and K. Kumaran, The BioPAX protein complexes or type of interactions which, if added, the community standard for pathway data sharing, Nature Biotechnology, vol. 28, pp. 935–942, 2010 system will be more infromative. Our tool is compatible with [3] E. G. Cerami, B. E. Gross, E. Demir, I. Rodchenkov, Ö. Babur, N. Biopax level 2 thus it may not generate the same expected Anwar, N. Schultz, G. D. Bader, and C. Sander, Pathway Commons, a result when it is provided with a data file in Biopax level 3. web resource for biological pathway data, Nucl. Acids Res., 2010 [4] C. Stark, B.J Breitkreutz, T. Reguly, L. Boucher, A. Breitkreutz, and Moreover, during the course of this work we have observed and M. Tyers, BioGRID: a general repository for interaction datasets, and analyzed Biopax format in detail. Some of the classes and Nucleic Acid Re., no. 1, pp. 535–9, 2006 properties introduced in Biopax appear unnecessary but also [5] A. Ceol, A. A. Chatr, L. Licata, D. Peluso, L. Briganti, L. Perfetto, L. Castagnoli, and G. Cesareni, MINT, the molecular interaction database: raise the level of complexity in the pathway representation 2009 update, Nucleic Acids Res., vol. 38,Database, 2010 and pathway analysis. Some of these complexity issues have [6] T. S. K. Prasad, R. Goel, K. Kandasamy, S. Keerthikumar, S. Kumar, been addressed and improved in later release of Biopax but S. Mathivanan, D. Telikicherla, R. Raju, B. Shafreen, A. Venugopal, L. Balakrishnan, A. Marimuthu, S. Banerjee, D. S. Somanathan, A. pathways represented in Biopax level 2 suffers from this Sebastian, S. Rani, S. Ray, and C. J. H. Kishore, Human Protein Reference unnecessary complexity. In this work we tried to diminish Database - 2009 Update, Nucleic Acids Research., no. 37, 2009 the amount of redundant data by omitting the biochemical reaction, left and right step in each pathway step and showing only the proteins involved in a single pathway at the most inner level.