An Analysis of Differences In Biological Pathway Resources

An Analysis of Differences In Biological Pathway Resources LucyLWang lucylw@uw.edu Department of Biomedical Informatics and Medical Education University of Washington Seattle

98195 Washington

JohnHGennari gennari@uw.edu Department of Biomedical Informatics and Medical Education University of Washington Seattle

98195 Washington

NeilFAbernethy neila@uw.edu Department of Biomedical Informatics and Medical Education University of Washington Seattle

98195 Washington

An Analysis of Differences In Biological Pathway Resources 347D716F360271A4C0BE3651A391F03C GROBID - A machine learning software for extracting information from scholarly documents pathway database knowledge representation resource comparison

Integrating content from multiple biological pathway resources is necessary to fully exploit pathway knowledge for the benefit of biology and medicine. Differences in content, representation, coverage, and more occur between databases, and are challenges to resource merging. We introduce a typology of representational differences between pathway resources and give examples using several databases: BioCyc, KEGG, PANTHER pathways, and Reactome. We also detect and quantify annotation mismatches between HumanCyc and Reactome. The typology of mismatches can be used to guide entity and relationship alignment between these databases, helping us identify and understand deficiencies in our knowledge, and allowing the research community to derive greater benefit from the existing pathway data.

I. INTRODUCTION

Describing and studying biological pathways is necessary for understanding biological and disease processes. Biological functions and processes follow from complex networks of interactions among gene products and molecules. Through the study of pathways of known biochemical reactions, we can gain deeper insights into these interactions. Many of these relationships and reactions have been catalogued in pathway resources such as Reactome, BioCyc, KEGG, and others [1][2][3][4][5].

As of April 2016, PathGuide, a pathway resource aggregator, lists 547 pathway resources [6], each providing specialized knowledge in niche areas of biology. Efforts have been made to integrate some of these databases. PathwayCommons catalogs human pathway resources under a unified biological pathway exchange umbrella (BioPAX), allowing easier querying of pathways across 22 different resources [7,8]. Tools such as Consensus Pathway DB [9] and hiPATHDB [10] offer querying and visualization of pathways from multiple databases. Statistical frameworks like R Spider seek to proba-This work was supported by the National Library of Medicine (NLM) Training Grant T15LM007442.

bilistically combine protein interactions from various pathway databases into merged networks [11]. These tools improve querying of multiple resource, and pave the way towards more comprehensive network models of human biological processes. Some work has also been done in inter-resource comparison, quantifying the overlap between different databases [12][13][14][15]. These comparison studies emphasize differences in entity membership in pathways and differing counts of unique entities and pathways, but do not focus on cross-resource entity alignment. Existing tools for entity normalization of proteins [16] and metabolites [17] may provide a starting point for alignment. Other studies emphasize aligning metabolic pathways of different species in order to find analogous but missing relationships [18,19], merging resources for combined network analysis [11,20], or defining conserved pathway elements across existing pathway resources [21]. However, although they represent progress, the tools and studies mentioned above accomplish goals that do not include aligning representations across resources.

Given the number and uniqueness of pathway resources, inter-resource merging is a challenge. In order to successfully align and integrate the content of multiple knowledge bases, we must contend with variability in content correctness, standards usage, knowledge representation choices, and coverage. Pathway data sharing standards such as BioPAX, SBML, and PSI MI [8,22,23] assist in the interchange of pathway resources, but even resources available in the same standard still retain differences in content and representation. Nonetheless, our goal is to align knowledge, so that users can benefit from a semantic union across multiple resources.

To align resources, we must comprehensively understand the types of differences one may encounter. Stobbe et al. have made an excellent start in this direction, providing numerous examples and descriptions of the sorts of differences among metabolic pathway resources [13,24]. Here, we extend this work, aiming at a typology We also present some results in quantifying annotation mismatches between two popular human pathway resources: HumanCyc and Reactome. Results demonstrate the pervasiveness of representational differences and suggest further work towards consensus pathway representations. Understanding the types of mismatches that exist between resources is a first step towards expanding and deriving the full benefit of our pathway knowledge.

II. MISMATCHES IN PATHWAY RESOURCES: A TYPOLOGY

To provide examples of mismatches, we retrieved reaction representations from HumanCyc, KEGG, PAN-THER, and Reactome. Fig. 1 shows several different representations of a step of glycolysis in Homo sapiens: the conversion of phosphoenolpyruvate and ADP to pyruvate and ATP modulated by the enzyme pyruvate kinase. In this single, well-studied biochemical reaction, we see a variety of important mismatches, of which a subset are described below. *

A. Annotation

We first consider annotation mismatches on the participating physical entities. Inconsistencies arise when two pathway resources refer to the same entity with different identifiers or different names. Pyruvate is represented by all four resources (Fig. 1), but is annotated with two identifiers, ChEBI:15361 (HumanCyc and PANTHER) and ChEBI:32816 (KEGG and Reactome). The ChEBI:15361 entity "pyruvate" and ChEBI:32816 entity "pyruvic acid" are conjugate acids and bases of one another in ChEBI. The display name for pyruvate also differs between resources, and is given as "pyruvate" (HumanCyc), "Pyruvate" (KEGG, PANTHER), or "PYR" (Reactome). Differences in identifiers and names are also seen for all other participants in this reaction.

In order to resolve these mismatches, we must either enforce consistent labeling of entities across resources, or somehow infer alignment of similar but differently annotated entities across resources. The former strategy is usually impractical; in this case, we can infer similarity by treating ChEBI identifiers that refer to conjugate acid/base pairs as synonyms.

A second type of annotation mismatch occurs when entities lack cross-referenced identifiers, e.g., no identifiers are given for ADP or ATP in PANTHER pathways. Other features such as string name, entity relationships, and local network topology can be used to align entities between resources when identifiers are insufficient.

B. Existence

Existence refers to missing or extraneous physical entities, reactions, relationships, or information, e.g., entities that participate in a reaction or reactions that are members of a pathway in one resource but not another, or a connection between two reactions that occurs in one resource but not another. In Reactome, for example, the conversion of fructose 6-phosphate to fructose 2,6biphosphate is a reaction in the glycolysis pathway. This reaction is not included in the glycolysis pathway of the other three resources. Although the reaction involves entities that participate in glycolysis, there is uncertainty in whether it is important to the overall process.

Another example of an existence mismatch is the inclusion of H+ in the conversion of phosphoenolpyruvate to pyruvate in HumanCyc (Fig. 1). The ion is included in order to balance reaction charge, but according to BioPAX3 documentation, reaction participants should be neutral and ions such as H+ and Mg2+ are not recommended for inclusion [25]. Other potential existence mismatches could occur if one resource lacks or is missing relevant information about a relationship between two entities, or one resource specifically negates the existence of a relationship asserted in another resource.

Existence mismatches can be resolved by taking the most common representation between many resources (democratic) or by integrating all possible representations (exhaustive). Although an exhaustive consensus method is unlikely to leave out information, it may, however, produce a large and unwieldy alignment.

C. Reaction semantics

Many differences in reaction representation have been described in Stobbe et al, such as using the terms left and right, product and substrate, and input and output to describe participants in reactions [24]. In BioPAX, the properties conversionDirection, stepDirection, left, and right are used to indicate reaction direction, as well as the identities of reactants and products [25]. In KEGG, PANTHER, and Reactome, phosphoenolpyruvate is labeled left and pyruvate right, with a reaction direction of left-to-right. However, in HumanCyc, phosphoenolpyruvate is labeled right and pyruvate left and the reaction direction is right-to-left, a choice dictated by the Enzyme Commission system [2]. Even though HumanCyc is in the minority, its choice follows recommendations from the BioPAX3 specifications [25].

Resolving this type of semantic mismatch between resources requires knowledge about the ordering of reactions, which can be derived from pathway design, or when reactions are taken out of context, may depend on chemical kinetics and the reacting environment. For well-studied pathways, a consensus ordering usually exists. When participant left and right labels differ between resources and ordering is unclear, the BioPAX pathway-Order object (designed to relay reaction topology) can sometimes be used along with reaction direction to infer the correct sequence.

D. Granularity

Mismatches of granularity occur when resources represent the same entity or process in different degrees of detail. One example is complex naming. Many reaction enzymes are complexes made up of multiple protein subunits. A reaction may be annotated with a protein modifier, when in actuality, it is catalyzed by a complex: a dimer, trimer etc. In Fig. 1, Reactome makes this distinction by annotating to the "pyruvate kinase tetramer," a complex composed of the pyruvate kinase protein referenced from the other three resources. Due to the lack of standardized complex naming, however, we often cannot easily align complexes and proteins between resources.

Another type of granularity mismatch occurs at the reaction level. For example, one resource may choose to represent the elementary steps of a reaction, including intermediate chemical species. A single reaction in one resource may be represented as several in another, with the same ultimate inputs and outputs. For example, the oxidative decarboxylation of isocitrate is a two step process, modified by the enzyme isocitrate dehydrogenase, producing α-ketoglutarate from isocitrate via an oxalosuccinate intermediate. The reaction can be represented both with and without the intermediate species, as in Fig. 2. In these cases, we can study the ultimate inputs and outputs of ordered reaction sequences to determine the appropriate reaction-level alignment.

III. ANNOTATION DIFFERENCES BETWEEN TWO

RESOURCES

We identify and enumerate mismatches in entity annotation between two exemplar resources: HumanCyc and Reactome. Compared to other mismatches, a disagreement in the annotation of entities could be viewed as primary: if two resources disagree on physical entities, then they are also likely to disagree on the reactions and pathways in which these entities participate.

The most confident match between entities in two resources arises when both identifiers and names match. For example, the molecule ATP matches both on name and ChEBI identifier for KEGG and Reactome (Fig. 1). Less confident are identifier matches without string name matches (e.g. HumanCyc and KEGG use different names for the entity cross-referenced to UniProt:P30613), and string name matches without identifier matches (e.g. HumanCyc and PANTHER cross-reference to different ChEBI identifiers for the entity named "phosphoenolpyruvate").

From HumanCyc and Reactome, we extract all proteins and small molecules with cross-referenced identifiers (UniProt for proteins, ChEBI for small molecules) and names. String names are taken as all objects of the BioPAX properties name, displayName, and stan-dardName on the entity of interest. Using only string names and UniProt/ChEBI identifiers, there are four possible ways that entities can match between these two resources. Entities in HumanCyc can match to entities in Reactome on ID and name (+I/+N), ID but not name (+I/-N), name but not ID (-I/+N), and on neither ID nor name (-I/-N). For this initial analysis, we define string name matches as case-insensitive equivalence, so small differences in spelling do not produce a match.

For each entity in HumanCyc, SPARQL queries are used to determine whether a matching entity exists in Reactome, and similarly, Reactome entities are matched I. Out of 2737 unique HumanCyc proteins, 2078 (75.9%) match to Reactome entities using identifiers and/or string names. Out of 16949 unique Reactome proteins, 2973 (17.5%) match to a HumanCyc protein on identifiers and/or name. Reactome references many protein isoforms, causing the large imbalance in unique protein counts between the two resources. These match ratios are illustrated in Fig. 3.

Table II shows matches for small molecules. In Hu-manCyc, 866 (53.8%) out of 1610 small molecules match on annotation to an entity in Reactome. In Reactome, 1591 (55.0%) out of 2891 small molecules match to entities in HumanCyc, with a large proportion (890 out of 1591) matching on string names only.

Cross-referenced identifiers are the gold standard of matching between two resources. Therefore, groups +I/+N and +I/-N likely consist of true matches. Group -I/+N can be used to learn about representational differences. Some of the cross-references for entities in this group point to secondary accession identifiers, which redirect to other identifiers in the same database. For example, UniProt:A0AVP9 redirects to UniProt:Q8IWU4, the entity Zinc transporter 8. For small molecules only, we also find annotation to ChEBI conjugate acids or bases (e.g., HumanCyc annotates with ChEBI:456216 (ATP(3-)), a conjugate base of ChEBI:16761 (ATP), which is used in Reactome), or annotation to tautomers (e.g., ChEBI: 16828 and ChEBI:57912 for L-tryptophan and the L-tryptophan zwitterion respectively). Annotation mismatches of the above subtypes are detected by querying the UniProt or ChEBI APIs using the BioServices 1.4.8 Python package [26].

Within -I/+N matches, the 55 HumanCyc and 88 Reactome proteins had 208 pairwise string name matches. Of these, 28 pairs had cross-referenced identifiers that are UniProt secondary accession IDs, indicating that they likely refer to the same entity. We could not confirm the identities of the other 180 pairs through UniProt accession identifiers. For small molecules in the -I/+N group, the 479 HumanCyc and 890 Reactome molecules had 1869 pairwise string name matches. Of these, at least 1506 pairs referred to similar entities. Annotation to ChEBI conjugate acids or bases accounted for the majority of these (1122), followed by annotation with ChEBI tautomer IDs (301), and ChEBI secondary accession numbers (83).

IV. DISCUSSION

In order to reduce redundancy and errors when merging information from different knowledge bases, we must correctly align entities and other assertions between resources. Entity alignment is a necessary first step before we can clarify higher-order concepts such as complexes, reactions, and pathways. As demonstrated using HumanCyc and Reactome, many proteins and small molecules can be matched between two resources using annotation features such as cross-referenced identifiers and string names. Among entities that share only string names, many have related identifiers that can be matched computationally. Related identifiers can be used to help improve the accuracy of annotations.

Moving beyond annotation, other issues of semantics and granularity come into play. For future work, we intend to incorporate other features, such as entity relationships and graph properties like degree and bipartite connectivity to assist in entity alignment.

Several limitations exist in this work. First, we only compared entities between two pathway resources, Hu-manCyc and Reactome. We expect to expand our analysis to include other resources as well. Although some of our current methods rely on BioPAX, our general ideas about physical entities and their annotations can be applied to data represented using other biological pathway knowledge standards.

Another limitation arises in the way we identify annotation mismatches. We only assessed proteins and small molecules with UniProt or ChEBI identifiers, excluding those entities without cross-references or with crossreferences to other databases. This was partially for simplicity and partially to limit the size of the comparison problem. For example, an agreement on one set of identifiers and a disagreement on another yields yet another class of mismatches.

Lastly, we were limited by our use of 100% string name matching to identify potential matched entities. By doing so, we limit our ability to detect positive matches and yield more conservative results, e.g., "fructose 1,6 bisphosphate" does not match to "D-fructose 1,6-bisphosphate"; the second is a stereoisomer of the first (generic) molecule, and they may play similar roles in reactions. Fuzzy string matches may perform better. However, we want to minimize the false positive rate, e.g., "fructose 1,6-bisphosphate" and "fructose 2,6bisphosphate" only differ by one character but refer to different molecules. With these caveats, the typology we present affords an opportunity to test different algorithms for the systematic alignment of pathway resources.

V. CONCLUSION

The complexity of pathway content is a barrier to resource integration, but as described above, we are also challenged by representational and content differences. Standards like BioPAX help clarify some differences between resources, but they do not solve all problems of interoperability. In order to draw from the spectrum of knowledge we have built as a community, the content of these resources must be aligned and integrated into something greater than the parts. Doing so involves identifying the differences between resources, and resolving those differences to understand shared meaning. Our results show that a sizable portion of physical entities can be aligned between pathway resources using existing cross-referenced identifiers and string names. However, annotation features alone are likely insufficient for matching a majority of entities between resources. Knowledge of entity relationships, reaction semantics, granularity, and more about these resources is necessary to create and evaluate potential alignments. Much of the work can be done computationally, and the typology above should guide the engineering of future matching algorithms.

To align and integrate knowledge across resources, the research community must have strategies for resolving these different sorts of mismatches. Some mismatches, such as those of annotation, can largely be resolved using the existing data. Other issues of semantics, such as differences in how standard languages are used to express the same knowledge, pose a bigger challenge. Resource developers should be allowed to make different choices in knowledge representation. However, this flexibility should not come at the cost of increased error or decreased interoperability. A better understanding of how specific mismatches occur will provide an incentive for resources to work toward interoperable data and representations.

Fig. 1 :1Fig. 1: The conversion of phosphoenolpyruvate and ADP into pyruvate and ATP assisted by the enzyme pyruvate kinase, as represented by HumanCyc, KEGG, PANTHER, and Reactome. The display name for each entity is given, along with ChEBI or UniProt identifiers where available. Entities related to the reaction by the BioPAX left property are red, and entities related by the BioPAX right property are green.

Fig. 2 :2Fig.2:The oxidative decarboxylation of isocitrate can be represented as a two-step process with an oxalosuccinate intermediary (left) and as a one-step process (right).

Fig. 3 :3Fig. 3: Protein (top) and small molecule (bottom) matches between Hu-manCyc and Reactome based on annotation, consisting of unmatched HumanCyc entities (red), unmatched Reactome entities (blue), and matched entities between the two resources (purple).

TABLE I :IProteins matches between HumanCyc and Reactome on UniProt identifiers and namesHumanCyc protein matches to Reactome+N-NTotal+I1264 759 2023-I55659714Total 1319 1418 2737Reactome protein matches to HumanCyc+N-NTotal+I1495 13902885-I8813976 14064Total 1583 15366 16949

TABLE II :IISmall molecule matches between HumanCyc and Reactome on ChEBI identifiers and namesHumanCyc small molecule matches to Reactome+N-N Total+I247 140 387-I479 744 1223Total 726 884 1610Reactome small molecule matches to HumanCyc+N-NTotal+I425276701-I890 1300 2190Total 1315 1576 2891to HumanCyc entities. Resulting matches for proteinsare given in Table

ACKNOWLEDGEMENTS

The authors thank Peter Karp for helpful comments on an early draft of this paper.

* Pathways were retrieved from Reactome v55 (http://reactome. org) and HumanCyc v19.5 (http://humancyc.org) BioPAX3 exports and through PathwayCommons v7 [7]. Glycolysis pathways for KEGG and PANTHER are located at http://purl. org/pc2/7/#Pathway 307add3cea6530288cc1016267ec055b and http: //identifiers.org/panther.pathway/P00024 respectively and are supplemented by the pathway diagrams at http://kegg.jp and http:// pantherdb.org.

The reactome pathway knowledgebase DCroft AMundo RHaw Nucleic Acids Res 42 2013 Database issue Computational prediction of human metabolic pathways from the complete human genome PRomero JWagg MGreen DKaiser MKrummenacker PKarp Genome Biology 6 R2 2004 Kegg: Kyoto encyclopedia of genes and genomes MKanehisa SGoto Nucleic Acids Res 28 2000 Panther: a library of protein families and subfamilies indexed by function PThomas MCampbell A Genome Res 13 2003 Wikipathways: capturing the full diversity of pathway knowledge MKutmon ARiutta NNunes Nucleic Acids Res 44 D1 2016 Pathguide: a pathway resource list GBader MCary CSander Nucleic Acids Res 34 2005 Database issue Pathway commons, a web resource for biological pathway data ECerami BGross EDemir Nucleic Acids Res 39 2011 Database issue The biopax community standard for pathway data sharing EDemir MCary SPaley Nature Biotechnology 28 9 2010 Consensuspathdb -a database for integrating human functional interaction networks AKamburov CWierling HLehrach RHerwig Nucleic Acids Res 37 2009 Database issue hipathdb: a human-integrated pathway database with facile visualization NYu JSeo KRho Nucleic Acids Res 40 2012 Database issue R spider: a network-based analysis of gene lists by combining signaling and metabolic pathways from reactome and kegg databases AAntonov ESchmidt SDietmann MKrestyaninova HHermjakob Nucleic Acids Res 38 2010 Web Server issue Consistency, comprehensiveness, and compatibility of pathway databases DSoh DDong YGuo LWong BMC Bioinformatics 11 2010 Critical assessment of human metabolic pathway databases: a stepping stone for future integration MStobbe SHouten GJansen AVan Kampen PMoerland BMC Systems Biology 5 2011 A systematic comparison of the metacyc and kegg pathway databases TAltman MTravers AKothari RCaspi PKarp BMC Bioinformatics 14 112 2013 Comparison of human cell signaling pathway databases -evolution, drawbacks and challenges SChowdhury RSarkar Database 126 2015 Integrating various resources for gene name normalization YHu YLi HLin ZYang LCheng PLoS One 7 9 e43558 2012 The chemical translation service -a web-based tool to improve standardization of metabolomic reports GWholgemuth PHaldiya EWillighagen TKind OFiehn Bioinformatics 26 20 2010 Submap: aligning metabolic pathways with subnetwork mappings FAy TKahveci Journal of Computational Biology 18 3 2011 Mp-align: alignment of metabolic pathways RAlberich MLlabrés DSánchez MSimeoni MTuduri BMC Systems Biology 8 58 2014 Using random walks to identify cancer-associated modules in expression data DPetrochilos AShojaie JGennari NAbernethy BioData Mining 6 1 17 2013 Unipathway: a resource for the exploration and annotation of metabolic pathways AMorgat ECoissac ECoudert KAxelsen GKeller ABairoch Nucleic Acids Research 40 2012 Database issue The systems biology markup language (sbml): a medium for representation and exchange of biochemical network models MHucka AFinney HSauro Bioinformatics 19 4 2003 The hupo psi's molecular interaction format -a community standard for the representation of protein interaction data HHermjakob LMontecchi-Palazzi GBader Nature Biotechnology 22 2004 Knowledge representation in metabolic pathway databases MStobbe GJansen PMoerland AVan Kampen Brief Bioinform 15 3 2014 May Biopax -biological pathways exchange language, level 3 July 2010 release version 1 documentation Bioservices: a common python package to access biological web services programmatically TCokelaer DPultz LHarder JSerra-Musach JSaez-Rodriguez Bioinformatics 29 24 2013