Looking into Reactome through Biopax Lens
                       Laleh Kazemzadeh ∗ , Helena Deus∗ , Michel Dumontier† and Frank Barry‡
                                              ∗ Digital Enterprise Research Institute
                                             National University of Ireland, Galway,
                                               Email : laleh.kazemzadeh@deri.org
                                                    † Department of Biology

                                               Ottawa Institute of Systems Biology
                                                         Ottawa, Canada
                                              Email: michel dumontier@carleton.ca
                                      ‡ National Centre for Biomedical Engineering Science

                                             National University of Ireland, Galway
                                                 Email:frank.barry@nuigalway.ie


   Abstract—In order to understand cell behavior under different     a pathway is not a trivial exercise due to the various types of
conditions, the computational simulation of biological pathways      components and interactions; regulation of pathways requires
is of great interest. Hence, to simulate a biological pathway com-   a cascade of events and interactions between genes, proteins
putationally, extensive knowledge of protein-protein interactions
(PPIs) in the pathway is required, along with the information        and small molecules.
about the generic flow of the pathway components i.e. biological        In addition, there is significant cross-talk between pathways,
reactions, which comprise the concerned pathway.                     which highlight the fact that pathways are not isolated but
   The popularity of Semantic Web technologies in tackling the       are made up of a network of components. As such treating
integrative bioinformatics challenges has increased, with various    them as a system as opposed to an enclosed and self-contained
approaches used to aggregate and correlate data from different
sources. However the integration of publicly available pathway       pathway, can support a more realistic investigation.
databases, to determine the different PPIs and hence effectively
simulate the cell behavior, has still various obstacles. In this                             II. STATE OF THE ART
paper, we present a semantic approach in pathway-wise analysis
of protein-protein interactions (PPIs) using Biopax standards fo-
                                                                        A large number of tools and applications, vocabularies and
cusing particularly on Reactome database. We have identified the     ontologies aimed at computationally modeling biological path-
PPIs involved in a given pathway by the hierarchical extraction      ways currently exist with enough precision to enable realistic
of its components (complexes, proteins, small molecules). We         simulations of its processes and determination of mechanism
have developed a visualization tool which automatically generates    of action of various molecular compounds; examples include
a visual representation of the directed graph of PPIs in any
specified pathway. Our approach provides intuitive inference of
                                                                     the systems biology markup language (SBML) [1] and the
the data by flattening the nested pathways in Reactome and their     Proteomics Standards Initiative-Molecular Interaction (PSI-
components instead of wrapping each layer of data in the shell       MI)1 . These models and data format are also devised to deepen
of outer pathway. We have also discussed that the representation     and broaden our understanding of pathways. A few models
of a pathway in Biopax standard format is highly complex and         also keep track of semantics, i.e. they attempt to precisely and
even contains redundant information. Hence tools are needed
in order to facilitate the navigation and analysis of pathway
                                                                     unambiguously describe each compound and each interaction
datasets, which have been structured in Biopax format.               such that they can be interpreted by applications and thus be
                                                                     integrated with other models. Biological Pathway Exchange
                      I. I NTRODUCTION                               (Biopax) [2] is one such data format. Biopax is a standard
                                                                     format for representing pathways and molecular interactions
   The functionality of the human body is tightly regulated by       within and between pathways which has been developed with
biological pathways. Basic building blocks of these pathways         the aim of facilitating the process of collecting, indexing and
are proteins, which act in an orchestra in order to keep             sharing data [2]. Several databases hosting pathway and pro-
the regulation of pathways intact. Therefore understating the        tein interaction information, such as Reactome2 and Pathway
dynamic of these pathways is directly dependent on under-            Commons [3], are already available in this format. Informa-
standing how the proteins involved in a pathway interact             tion retrieved from expert-curated databases like Reactome
with each other. Interaction between two proteins might be           is highly valuable for scientific advancement since they are
of different types e.g. activation, inhibition, and methylation.     the most accurate training data sets. However, because they
Analyzing biological data from a pathway perspective can             rely on human curation, they suffer from limited coverage
result in valuable information about the process of disease and      in the amount of interactions available. Integrating such data
suggest new drug discovery methods that target mis-regulation
in specific pathways, thus enabling a much more precise                1 http://www.psidev.info

targeting of diseases. However, computationally representing           2 http://www.reactome.org
                                                                            sequenceParticipant3539
                                                                            sequenceParticipant3540          P36611
                                                                            sequenceParticipant3541
                                        biochemcialReaction988
                                        biochemcialReaction989
                                                                 989        sequenceParticipant3542
                                        biochemcialReaction990
                                                                            sequenceParticipant3543          Q13813
                                        biochemcialReaction991
                                                                            sequenceParticipant3544
                                        biochemcialReaction992
                   Pathway336           biochemcialReaction993
                                        biochemcialReaction994
                                        biochemcialReaction995
                                        biochemcialReaction996
                                        biochemcialReaction997              sequenceParticipant3568
                                                                            sequenceParticipant3569             Null
                                                                            sequenceParticipant3570


Fig. 1: Example of redundancy and incompleteness of data represented in Biopax level2 taken from caspase-mediated cleavage
of cytoskeletal pathway. Blue box indicates the sample pathway, orange boxes represent list of biochemical reactions associated
to this pathway, green boxes show sequence participant at left and right of each biochemical reaction, red boxes depict the
unique Uniprot ID for each protein which each left and right of a biochemical reaction points to.


warehouses in one standard format will improve the coverage            the overlaps between several pathways. Furthermore, each
and highlight the role of Biopax in standardization. There is          biochemical reaction is described as a function of the “left”
an enormous potential in using the information represented             and “right” hand side of the stoichiometric equation. Fig. 1.
in Biopax format to realistically address biological questions,        illustrates an example of data complexity and redundancy in
for example, the metabolic effects of a compound in the                representing biochemical reactions involved in pathway336
cell or how certain alterations in the metabolic network               (caspase-mediated cleavage of cytoskeletal). As it is mentioned
can be at the root cause of diseases or drug resistance.               before each biochemical reaction has left and right components
The discovery and confirmation of a biologically meaningful            each of which refers to unique and separate sequence partic-
molecular interaction often requires the analysis of enormous          ipant. However, each of these sequence participants points to
amount of heterogeneous data which are typically deposited             the same protein ID from UniProt database. In other word,
in local databases and isolated from each other. Therefore,            both left and right of a given biochemical reaction point to the
considerable amount of molecular interactions are “hidden” in          same protein and this increases the redundancy of the data. The
this data, which can only be exposed once these results are            aim of our work is to devise a tool that aggregates information
integrated and recurrence of patterns indicative of interactions       from this data e.g. the protein interactions and components
analyzed. The data integration challenges in life science have         of protein complexes in pathways. This will allow us to
motivated the researchers to adapt the new integration tech-           easily identify common interaction between various compo-
nologies offered by Semantic Web and Linked Data. Semantic             nents (proteins, complexes, etc.) across pathways, abstracting
Web technologies can provide a bridge between the datasets,            from the complexity of pathway representation in Biopax. The
enabling the discovery of links, which are often not obvious.          data analysis tools made available by Reactome are unable
These bridges are often standard vocabularies and ontologies           to provide this inner-pathways analysis unless pathways are
developed toward improvements in knowledge discovery that              nested or siblings.
lead to the next challenge: the representation, application
and acceptance of these standard vocabularies by the domain                                     III. M ETHODS
experts. The motivational scenario for the work presented here            One typical way of querying a pathway or interaction be-
is the extraction of all the molecular components that act in a        tween two proteins from different online databases is through
particular biological process as described by Biopax in its vari-      browsing their webpage. As easy as it seems, it is time
ous data sources. We have chosen Biopax firstly because it has         consuming and cumbersome to go through all the databases
been adapted by several databases, which provide information           available manually. Instead we can query the PPIs directly
in signalling pathways and secondly becasue it faciliates data         from the raw data provided by the databases like Reactome
integration from other sources containing protein information.         and other such pathway databases. We propose an approach
   Biopax has been developed to capture various aspects of             to overcome such problems which is explained below.
signalling, regulatory and metabolic pathways. However in                 Fig. 2 shows an overall view of the steps, which were
order to provide a descriptive solution and to cover all details       taken in our approach in order to identify the protein-protein
in the description of pathways, some complexity needed to be           interactions pathway-wise. We downloaded the protein-protein
introduced. In Biopax each pathway is constructed in the form          interaction file for Homo sapiens from Reactome webpage
of nested pathways which partially, but not fully, illustrate          in Biopax format. This data was uploaded to our Sesame
                                                   Fig. 2: Overall view of the proposed method.


server3 in the form of triples. The Aggregator module has                 our method we were able to generate a pathway wise PPIs
been developed in order to extract the components involved                network which is shown and discussed below.
in a pathway and break down the pathway to the level of                      Fig. 3 shows a small part of the network visualization
complexes, proteins and molecules.                                        generated by our tool for the Apoptosis pathway. The gen-
   The system provides a list of selectable pathways compat-              erated network contains 60 interactions between 40 pathways,
ible with the pathways names used in Reactome. The ID of                  representing nested pathways in Reactome, and 87 proteins
the selected pathway e.g. Apoptosis or Programed Cell Death               involved in inner pathways of Apoptosis. Here we show
(PCD) is retrieved from the triple store by the ID Retrieval              the interaction between pathway336 and pathway335, which
module. The Pathway Step Retrieval retrieves the list of inner            are caspase-mediated cleavage of cytoskeletal proteins and
pathways (pathway-steps) forming the selected pathway. Each               apoptotic cleavage of cellular proteins pathways respectively.
of these pathways is segregated hierarchically in the Extraction          These two pathways are part of outer pathways of Apoptotic
module.                                                                   execution phase and Apoptosis, which are not shown here.
   The extracted data from Pathway Step contains bundle of                   The number of identified proteins in pathway336 is 8, while
relational information explaining reactions, complex blocks,              the number of reported proteins for the same pathway in
proteins and small molecules forming complexes. Network                   Reactome database is 32. The reason for these differences is
Generator constructs a model in the final stage from the data             that some of the reported proteins in Reactome point to the
extracted in the previous step. This model is then fed to the             same unique protein identifier. As an example protein P08670,
network visualizer, which renders and displays the relational             Vimentin, has been mentioned 7 times. Likewise Q151149 and
graph between components of the pathway. In this model, the               the rest of identified proteins have been reported 3 times. Our
relation between each entity, complex, protein and molecule               algorithm was not able to identify 3 proteins (caspase 3,6,7)
in the pathway is illustrated in a directed graph where nodes             in the list of 32 proteins reported in Reactome database due
represent the entities, pathways, proteins and molecules and              to incompleteness of the original data which was downloaded
edges represent the connections between source and target                 from the Reactome webpage.
nodes or the higher level and lower level components in a                    Of great interest in pathway anlysis is identification of
pathway tree.                                                             protein hubs. Protein hubs are those proteins with high degree
   The interaction Aggregator is written in PHP using ARC24               of connectivity and more likely to be essential in the cell.
package in order to query the Reactome triples. The force-                Example of such a protein is shown in Fig. 4. Protein Q14790
directed graph is generated by the Data Driven Documents                  (caspase 8), appears to be involved in the following pathways:
(d3)5 , library written in Javascripts.                                   Fasl/DD95L signaling (pathway309), TNF signaling (path-
                                                                          way310), Trail signaling (pathway311), Formation of caspase
                              IV. R ESULTS                                8 (pathway312), Activation of pro-caspase 8 (pathway313) and
  Raw material in our approach is an input .owl file, which               Apoptotic execution (pathway 334). Knowing the protein ID
contains the information of any pathway in Biopax. Applying               or name and assuming the protein of interest is involved in
                                                                          different pathways we are able to retrieve the same information
  3 http://hcls.deri.org:8080/openrdf-workbench/repositories/             from Reactome search tool, however it does not give us the
  4 https://github.com/semsol/arc2/wiki                                   intuitiveness of the visualization. Querying the same protein,
  5 http://d3js.org/                                                      casapse 8, in Reactome returns more hits than the number of
Fig. 3: Directed graph generated by the network visualizer.        Fig. 4: Protein hub connecting six inner pathways in the
Graph shows the interaction between and within two pathways.       Apoptosis pathway.
Pathways and proteins are shown with their unique IDs. Each
edge represents the connection between pair of source and
target nodes. Dark Blue: pathways, light blue: proteins, orange:                            VI. F UTURE W ORK
catalysis.
                                                                      Future work will be the integration of pathways and in-
                                                                   teractions from other databases like BioGrid [4], MINT [5],
                                                                   HPRD [6] and the expansion of the query and visualization in
pathways we discussed here since we limited the search only
                                                                   such a way that two or more pathways from different sources
to the Apoptosis pathway and not all the pathways exist in
                                                                   can be queried and the common interactions highlighted.
Reactome.
                                                                   Furthermore, identified interactions will be ranked based on
                      V. C ONCLUSION                               the number of occurrence in the databases and the literature.
   In this work we were able to extract PPI associated with any                          ACKNOWLEDGMENT
given pathway. Our visualization provides a better representa-       This work has been funded by Program for Research in
tion of elements involved in a pathway since it is capable of      Third Level Institutions (PRTLI) Cycle 5, which is co-funded
retrieving and representing data while conserving the hierarchy    by the European Regional Development Fund (ERDF).
in which data was originally represented. Our aim was to
highlight the PPIs in the pathways hence we represented only                                    R EFERENCES
pathways and proteins in the deepest level of each pathway         [1] M. Hucka, A. Finney, H. M. Sauro, H. Bolouri, J. C. Doyle, H. Kitano, A.
step of an outer pathway. However the data retrieved from the          P. Arkin, B. J. Bornstein, D. Bray, A. Cornish-Bowden, A. A. Cuellar, S.
                                                                       Dronov, E. D. Gilles, M. Ginkel, V. Gor, I. I. Goryanin, W. J. Hedley, T.
triple store by Aggregator contains more information about             C. Hodgman, and J. Hofmeyr, he Systems Biology Markup Language
each pathway than only its components (e.g. pathway name)              (SBML): A medium for representation and exchange of biochemical
and with the current structure of our tool it is possible to add       network models, Bioinformatics, vol. 19, pp. 524–531, 2003
                                                                   [2] E. Demir, M. Cary, S. Paley, K. Fukuda, C. Lemer, I. Vastrik, G. Wu, P.
an extra layer of data to the Network Generator and create a           D’Eustachio, C. Schaefer, J. Luciano, F. Schacherer, I. Martinez-Flores,
visual representation of the extended network including e.g.           Z. Hu, V. Jimenez-Jacinto, G. Joshi-Tope, and K. Kumaran, The BioPAX
protein complexes or type of interactions which, if added, the         community standard for pathway data sharing, Nature Biotechnology,
                                                                       vol. 28, pp. 935–942, 2010
system will be more infromative. Our tool is compatible with       [3] E. G. Cerami, B. E. Gross, E. Demir, I. Rodchenkov, Ö. Babur, N.
Biopax level 2 thus it may not generate the same expected              Anwar, N. Schultz, G. D. Bader, and C. Sander, Pathway Commons, a
result when it is provided with a data file in Biopax level 3.         web resource for biological pathway data, Nucl. Acids Res., 2010
                                                                   [4] C. Stark, B.J Breitkreutz, T. Reguly, L. Boucher, A. Breitkreutz, and
Moreover, during the course of this work we have observed              and M. Tyers, BioGRID: a general repository for interaction datasets,
and analyzed Biopax format in detail. Some of the classes and          Nucleic Acid Re., no. 1, pp. 535–9, 2006
properties introduced in Biopax appear unnecessary but also        [5] A. Ceol, A. A. Chatr, L. Licata, D. Peluso, L. Briganti, L. Perfetto, L.
                                                                       Castagnoli, and G. Cesareni, MINT, the molecular interaction database:
raise the level of complexity in the pathway representation            2009 update, Nucleic Acids Res., vol. 38,Database, 2010
and pathway analysis. Some of these complexity issues have         [6] T. S. K. Prasad, R. Goel, K. Kandasamy, S. Keerthikumar, S. Kumar,
been addressed and improved in later release of Biopax but             S. Mathivanan, D. Telikicherla, R. Raju, B. Shafreen, A. Venugopal,
                                                                       L. Balakrishnan, A. Marimuthu, S. Banerjee, D. S. Somanathan, A.
pathways represented in Biopax level 2 suffers from this               Sebastian, S. Rani, S. Ray, and C. J. H. Kishore, Human Protein Reference
unnecessary complexity. In this work we tried to diminish              Database - 2009 Update, Nucleic Acids Research., no. 37, 2009
the amount of redundant data by omitting the biochemical
reaction, left and right step in each pathway step and showing
only the proteins involved in a single pathway at the most
inner level.