=Paper= {{Paper |id=Vol-2137/paper_29.pdf |storemode=property |title=Similarity Metrics for Determining Overlap Among Biological Pathways |pdfUrl=https://ceur-ws.org/Vol-2137/paper_29.pdf |volume=Vol-2137 |authors=Lucy L. Wang,John H. Gennari |dblpUrl=https://dblp.org/rec/conf/icbo/WangG17 }} ==Similarity Metrics for Determining Overlap Among Biological Pathways== https://ceur-ws.org/Vol-2137/paper_29.pdf
      Similarity Metrics for Determining Overlap Among Biological
                                 Pathways
                                             Lucy L. Wang ∗, John H. Gennari
       Department of Biomedical Informatics and Medical Education, University of Washington, Seattle, USA




ABSTRACT                                                                 merged based on user needs to generate sets of improved pathways
   Systems biology researchers often rely on the use of one or           for secondary use.
more pathway resources for analysis of gene expression data or              In this paper, we report on our initial efforts at organizing and
experimental results. Unfortunately, there is no single, gold-standard   determining the overlaps of pathways from seven different well-
pathway knowledge resource, nor are there good ways to merge             known resources. Our long-term goals are two-fold: First, we
or combine information from multiple resources. What is needed           wish to improve secondary analyses by creating a more consistent
is clear organization of pathways, whereby represented processes         and custom-tailored set of pathways for use. Second, we aim to
can be enumerated and compared between resources. In this                develop or improve on a standard nomenclature and organization
paper, we develop a set of similarity metrics based on (a) pathway       for pathways. We see this as a significant gap for the development
participants, (b) pathway names and descriptions, and (c) pathway        of biological ontologies. Although there are well-established, vetted
topological information, which can be used to infer similarity and       reference ontologies for the participants of pathway processes (such
hierarchical relationships among pathways from different databases.      as Entrez Gene for genes, UniProt for proteins and ChEBI for
These inferred relationships can be used to derive annotations to the    molecular entities (Maglott et al., 2011; Apweiler et al., 2004;
Pathway Ontology or other pathway organizational schemes.                Degtyarenko et al., 2008)), ontologies for higher-level biological
                                                                         process names are lacking, or at least not well-used.
                                                                            The Gene Ontology includes names for biological functions, but
1   INTRODUCTION
                                                                         these are mostly at the reaction level, and are not well-organized
Pathway databases provide useful structured knowledge for                into coordinated sets of reactions (Ashburner et al., 2000). A better
bioinformaticists and systems biologists, who use pathways to            starting point for an organization of pathway knowledge is the
assist in the analysis of gene expression data, build models of          Pathway Ontology (PW) (Petri et al., 2014). The PW describes
physiological processes, and explore the connections between             classes of pathways based on biological function. Pathways in the
therapeutics and disease. Pathways describe a set of biomolecular        PW are organized using is-a and part-of relationships, where “A
reactions and interactions. They can have somewhat arbitrary             part-of B” indicates that A is a subprocess of B. Pathways with
beginnings and endings, but they aim to capture the details of a         instantiation (is-a) or subprocess (part-of) relationships are located
biological process or function.                                          closer to their parent pathways in the PW hierarchy.
   Researchers can choose from a large number of pathway                    Current pathway resources (KEGG, Reactome etc.) do not use
databases and representations. The abundance of choice can               the PW, but instead may be organized via some custom-tailored
lead to confusion, since different databases can offer redundant         ontology or hierarchy. This leads to problems when comparing sets
and sometimes conflicting accounts of the same pathway. Many             of pathways from resources with disparate ontological structure and
applications of pathway resources naively combine pathway data           organization. When pathways from multiple resources are combined
sets from multiple resources (e.g. MSigDB, often used as a source        for secondary use, a shared overarching organizational scheme often
of gene sets for gene set enrichment analysis (GSEA), includes gene      does not exist. The employment of different ontologies or simply
sets derived from several pathway databases (Liberzon et al., 2015);     the lack of any initial ontological structure make it challenging to
or ConsensusPathDB, which generates pathway-based networks               determine pathways that describe similar function.
using pathways from 32 resources (Kamburov et al., 2009)), but              There is no well-accepted standard measure of similarity among
both redundancies and conflicts can undermine the output produced        pathways. Other researchers have discussed similarity of pathway
by these tools. Results of secondary analysis using pathway              names, gene and molecular membership, and functional annotations
databases will change depending on the database chosen (Green and        as potential indicators of pathway content similarity (Grego et al.,
Karp, 2006). Khatri et al discuss annotation inaccuracies in pathway     2010; Belinky et al., 2015). In this paper, we assess these sorts of
databases as a challenge to pathway analysis (Khatri et al., 2012).      similarity metrics. We focus on three aspects of similarity: pathway
A recent publication by Ballouz et al also discusses bias in GSEA        names and descriptions, entity membership, and pathway topology.
due to overlaps between gene sets used for analysis, which are often        We approach this problem as one analogous to record linkage
derived from pathways (Ballouz et al., 2016). These difficulties arise   and deduplication, where data from similar or identical records can
because pathways share membership and content, which necessarily         be combined to yield better information (Christen, 2011). Instead
affects analysis performed using overlapping pathways. Instead of        of combining data records, we are identifying overlapping, subset,
using pathways as they are, we propose that individual pathways          or duplicate portions of pathway representations from different
from different databases should be pre-organized by similarity, and      pathway databases. The linkage process we apply to pathways
                                                                         consists of several steps: (1) data extraction and cleaning, (2)
∗ To whom correspondence should be addressed: lucylw@uw.edu              entity normalization, (3) indexing to yield pairs of pathways for



                                                                                                                                             1
Wang & Gennari



comparison, (4) generation of similarity metrics, and (5) evaluation                      Resource        Number of pathways
of output. This paper focuses on describing the procedures used to                        HumanCyc        242
complete steps 1-4, and shows some initial examples of step 5.                            KEGG            122
   We first extract and clean pathway data from seven well-known                          NCI PID         745
public pathway databases. We take advantage of Biological Pathway                         Panther         177
                                                                                          Reactome        2080
Exchange (BioPAX), an standard format supported natively by
                                                                                          SMPDB           724
many pathway databases (Demir et al., 2010), and meta-resources                           WikiPathways 351
like PathwayCommons that provide standardized pathway data                                Total           4441
(Cerami et al., 2011). Entity normalization involves identifying                           Table 1. Pathway counts per resource
objects from different databases that reference the same biological
entity. Indexing is an initial reduction of the number of in-             The pathway is represented as an undirected graph where nodes
depth pairwise pathway comparisons that need to be made.                  represent physical entities, or concepts such as reactions, and edges
Metrics such as graph edit distance are computationally expensive,        represent relationships. Subpathway relationships were retained to
making exhaustive pairwise pathway comparisons cost- and time-            assist in the exploration of subset relationships between pathways.
prohibitive. Within the reduced pairwise comparisons, we can
generate and evaluate similarity metrics such as entity membership        2.2   Entity normalization
overlap and graph similarity.                                             Entity normalization is the process of identifying equivalent or
                                                                          similar entities from different pathway resources. Pathways from
                                                                          all seven resources annotate their entities using external reference
2     METHODS & RESULTS                                                   identifiers, e.g., Entrez and UniProt identifiers for proteins, ChEBI
To measure similarity among pathways, we use a combination of             identifiers for molecules, etc. These cross-reference identifiers
entity membership, pathway name and description, and topological          offer a starting point for entity normalization. However, based on
similarity metrics. Entity membership overlap has been used in            previous observations, these identifiers alone do not do a sufficient
previous efforts to combine pathways into functionally similar            job of aligning like entities between databases (Wang et al., 2016).
superpaths (Vivar et al., 2013; Belinky et al., 2015). These              One main issue is the existence of synonymous identifiers (e.g.,
superpaths can be used to generate gene sets with little to no            secondary accession identifiers) and related identifiers (e.g., ChEBI
redundancy. However, we hypothesize that there are differences            conjugate acids/bases) in cross-reference databases, i.e., a single
between pathway overlap and pathway subset relationships that may         entity can reference two or more identifiers. Another issue is the
benefit from more detailed investigation. Likewise, graph alignment       existence of multiple cross-reference databases for a particular class
methods have been used to compare pathways between species to             of biological entities. We must therefore normalize entities both
discover evolutionarily conserved modules (Peregrn-Alvarez et al.,        within and among different cross-reference identifier databases.
2009; Muto et al., 2013). In our case, we are interested in using these      We generated an identifier normalization dictionary starting with
techniques to identify areas of similarity and differences between        the cross-reference identifiers given in each pathway database. If
pathway representations.                                                  a single entity references two or more identifiers, for example,
                                                                          the protein “Tryptophan 5-hydroxylase 1” in HumanCyc references
2.1    Pathway data extraction                                            both Entrez:AAA67050 and UniProt:P17752, then we infer
The dataset we used includes pathway data from each of                    synonymy between these two identifiers regardless of their given
seven resources: HumanCyc, the Kyoto Encyclopedia of Genes                synonymy in Entrez and UniProt. Further synonyms are derived
and Genomes (KEGG), the National Cancer Institute’s Pathway               from UniProt and ChEBI services. We queried for secondary
Interactions Database (NCI PID), Panther Pathways, Reactome,              accession numbers of all UniProt identifiers, and for secondary
Small Molecule Pathway Database (SMPDB), and WikiPathways                 accession numbers, conjugate acids and bases, and tautomers
(Romero et al., 2004; Kanehisa and Goto, 2000; Schaefer et al.,           of all ChEBI identifiers referenced in our dataset. Lastly, we
2009; Thomas et al., 2003; Croft et al., 2013; Frolkis et al., 2010;      supplemented this normalization dictionary using BridgeDB, a
Kutmon et al., 2016) Of these, HumanCyc (v20), Panther (v3.4.1),          service for mapping identifiers across different cross-reference
Reactome (v59), and SMPDB (version published June 5, 2016)                databases (van Iersel et al., 2010). For identifiers extracted from
were acquired in BioPAX format from the resources directly, KEGG          our pathways, we queried synonyms from BridgeDB (Ensembl
and NCI PID were downloaded in BioPAX format from Pathway                 and UniProt for proteins, ChEBI and PubChem for molecules),
Commons (PC8), and WikiPathways was exported in Graphical                 which were used to derive further equivalences between different
Pathway Markup Language (GPML) format on December 10, 2016.               identifiers.
A total of 4,441 pathways were extracted, and the pathway counts             For the following entity membership comparisons, we normalized
per resource are given in Table (1).                                      the entities in each pathway based on their cross-reference
   For each pathway within these resources, we extracted the              identifiers along with the additional synonyms we derived from
pathway name, any comments or descriptions of the pathway                 BridgeDB, UniProt, and ChEBI. Although this improved the
content, the set of entities participating in the pathway, the            number of entities matched among resources compared to using
relationships between those entities, as well as any subpathway           naive cross-reference identifier matching alone, there were still
relationships (similar to the part-of relationship described in PW).      many entities for which the appropriate normalization could not be
Pathway entities are physical entities (protein, complex, molecule,       obtained. Further normalization of both entities and relationships
DNA, RNA etc.) that participate (as a reactant, product, or               was explored using graph alignment techniques, which are
modifier) in a reaction explicitly described as part of the pathway.      discussed in section 2.5.


2
2.3   Indexing using pathway names and descriptions                     pathways from all resources have matches in all other resources,
Determining groups of similar pathways is a problem akin to that of     the theoretical minimum number of clusters is 2,080. However,
record linkage and deduplication. Indexing techniques are used in       assuming that some pathways may have multiple matches in other
record linkage problems to determine likely pairs of similar records,   resources and possibly within the resource itself, we selected a k of
which can then be compared in depth (Christen, 2011). Dividing          1,040, 0.5 times the theoretical minimum, as a conservative choice.
the data into blocks (blocking) and only comparing within blocks        Similarly, for pathway descriptions, we computed a theoretical
is an effective way to reduce computational cost. Due to the time       minimum of 2,053 clusters, and used a k of 1,026.
and resource cost of computing graph edit distance, we employ              For N total pathways, blocking reduces the Pnumber of pairwise
blocking of pathway representations based on name and description       comparisons from C(N, 2) (N choose 2) to ki=1 C(ci , 2) where
                                                                                                                   Pk
similarity. This reduces the total number of pairwise comparisons       ci is the size of the i-th cluster and        i=1 ci = N . With
from around 10 million (4,441 choose 2) to a much smaller number        N = 4441, the number of exhaustive comparisons is around 9.9
based on the number and sizes of blocks generated.                      million. Clustering on pathway names resulted in 1,040 clusters,
   Using pathway name similarity as a measure for pathway content       of which 584 were singleton clusters. This reduced the number
similarity has not been very successful in the past (Belinky et al.,    of pairwise comparisons to around 250,000. Clusters ranged from
2015). Very few pathways share identical names across resources,        those consisting of pathways that share an identical name, such
and those with identical or similar names usually vary significantly    as the cluster for “Glycolysis,” consisting of pathways from
on content, as measured by entity membership. However, even             HumanCyc, Panther, Reactome, and SMPDB, to those that show
though pathway name alone is not a good proxy for content, it,          a common theme, such as a cluster of 11 pathways with names such
along with a free-text description of the pathway, should yield         as “G2/M DNA damage checkpoint,” “Mitotic G2-G2/M phases,”
blocks of pathways with higher within-block similarity than random      and “response to G2/M transition DNA damage checkpoint signal.”
chance. That is, we believe that names and descriptions offer some      Some clusters contained unrelated pathways with shared words in
information about the content of a pathway representation.              their names, which may have clustered together because k was
   Of the 4,441 pathways in our data set, only 2,627 had analyzable     artificially lowered.
pathway descriptions. The remainder had either no pathway                  For pathway descriptions, exhaustive analysis yields around 3.5
description, or a pathway description containing meta-information       million pairwise comparisons. Clustering on descriptions resulted
on the writing, editing, or reviewing of the pathway. Some              in 1,026 clusters, of which 849 were singleton clusters, thereby
resources, such as NCI PID and SMPDB, had no descriptions of            reducing the number of pairwise comparisons to around 150,000.
any of the pathways encoded in their BioPAX exports. We therefore       Inspection reveals clusters where pathways explore similar themes,
analyzed the pathway names and descriptions separately, generating      such as a large cluster that includes pathways dealing with DNA
two sets of pathway blocks. We describe the work done for pathway       synthesis and repair, and another that deals with pathways of fatty
names; the process was repeated for pathway descriptions.               acid metabolism.
   We treat each pathway name as a document, and cluster them           2.4   Using entity membership overlap to determine the
into topics. This is accomplished by calculating the term frequency-
                                                                              validity of clustering results
inverse document frequency (tf-idf) statistic for each word in each
pathway name. The tf-idf is a measure of the significance of a word     We can test the validity of clustering on pathway names and
to a group of documents, and is often used for term-weighting in        descriptions using the independent measure of pathway entity
topic modeling. The statistic is given by equation (1), where the       membership overlap. Entity membership overlap can be represented
statistic for wi,d (word i in document d) is the product of the term    using the Jaccard index, a measure of set similarity, defined for two
frequency tfi,d (how often the word i appears in document d),           sets (S1 and S2 ) as the ratio of the size of their intersection over the
and the log inverse of the document frequency dfi (the number of        size of their union (equation (2)).
documents in the corpus that contain word i) divided by N , the total
                                                                                                             |S1 ∩ S2 |
number of documents in the corpus.                                                             J(S1 , S2 ) =                             (2)
                                                                                                             |S1 ∪ S2 |
                                         
                                             N
                                                                          For each pathway, its entity membership is represented as the set
                    wi,d = tfi,d × log                            (1)   Pi = {e1 , e2 , ...en }. The Jaccard index is computed between a
                                             dfi
                                                                        pair of pathways i and j as J(Pi , Pj ). From k-means clustering
   After computing tf-idf scores for all words in all pathway names,    results on pathway names, we performed pairwise comparisons
we performed k-means clustering on the tf-idf score vectors to          of all pathways within each cluster. The average pairwise within-
generate blocks of similar pathway names. K-means clustering            cluster Jaccard index (APWJ) was 0.021. We generate an expected
is a centroid-based unsupervised clustering method that yields k        APWJ distribution using the following bootstrapping method. We
clusters from the input data where each data point belongs to           randomly sample the data into clusters corresponding to the cluster
the cluster with the closest mean. Due to the imbalance in the          sizes of our k-means output. Within these generated clusters, we
number of pathways from each resource, we expect many singleton         compute the APWJ. We randomly sample 10,000 times to generate
clusters, those with only one member. We initially employed the         an expected distribution for APWJ. This expected distribution
elbow method to select the number of resulting clusters k, but          is Gaussianly distributed with mean 0.013 and sigma 7.0e-4
because no clear elbow was seen in within-cluster variance as we        (Figure (1A)). The APWJ of our pathway name k-means clustering
increased the cluster count, we were unable to determine an ideal       results falls more than 11 standard deviations away from the mean of
k experimentally. Instead, we calculated a theoretical minimum k        this expected distribution. This indicates that our pathway clusters
based on the number of pathways in each resource. Assuming all          show significantly higher entity overlap than clusters generated


                                                                                                                                               3
Wang & Gennari



through random sampling of the data. In other words, pathway name          perfect topological match between the two graphs. A higher GED
is effective at blocking the data into content-similar groups.             score indicates improved topological matching, which by itself does
                                                                           not guarantee accurate entity matching. Two graphs with the same
                                                                           topology and completely different entity memberships will have a
                                                                           high GED score, so the GED is only useful in the context of high
                                                                           entity Jaccard index. GEDEVO also does not penalize having graphs
                                                                           of different sizes, and extra nodes and edges can remain unmatched.
                                                                              Because there is no gold standard entity alignment among
                                                                           pathway representations, evaluations of the goodness of the graph
                                                                           alignments produced could only be done manually. The global
                                                                           alignment showed promise in cases where a good portion of
                                                                           all entities were prematched. Otherwise, the alignment did not
                                                                           offer usable entity alignments. For example, figure (2) shows
                                                                           an alignment between the Reactome pathway “Phenylalanine
                                                                           and tyrosine catabolism” and the HumanCyc pathway “tyrosine
                                                                           degradation I,” two pathways that clustered together based on
                                                                           pathway name tf-idf scores. The entity memberships of the
                                                                           pathways show overlap (Jaccard index = 0.43). In this case, the
                                                                           entities in the HumanCyc pathway are actually a subset of the
                                                                           entities in the Reactome pathway, so we expect a potential part-
                                                                           of relationship between the two pathways. In figure (2) we observe
                                                                           this subset relationship. We recognize that visualization of complex
                                                                           pathway information is an open and largely unsolved problem
                                                                           outside of our scope; Figures (2) and (3) are hand-drawn.




      Fig. 1. The average within-cluster Jaccard index for 10,000 random
        clusterings of pathway names (A) and pathway descriptions (B).

   The same procedure was followed for pathway descriptions. The
bootstrapped APWJ distribution for our data had mean 0.016 and
sigma 9.0e-4 (Figure (1B)). The APWJ for k-means clustering
results was 0.027, more than 12 standard deviations away from the
expected value for random clusters, indicating significant content
overlap in our clusters compared to random.                                Fig. 2. Graph alignment results of two metabolic pathways with a subset
                                                                            relationship. Entities found in both pathways are outlined in black. Gray
2.5      Employing graph edit distance                                          lines and circles are those relationships and entities found only in
Graph edit distance (GED) is a measure of similarity between two           Reactome; dotted gray lines show relationships only found in HumanCyc.
                                                                                All reactions are labeled ’Rx’; all complexes are labeled ’C’; and
graphs. The measure is based on the number of node and edge
                                                                           abbreviations for proteins and molecules are taken from Reactome. Green
insertions, substitutions, or deletions necessary to transform one
                                                                              entities are prematched between the two pathways on cross-reference
graph into another. The measure is calculated by performing a                identifiers. Yellow entities are correctly aligned by GEDEVO, and red
global graph alignment between two graphs, and then calculating              entities are incorrectly aligned by GEDEVO and manually aligned by
the number of transformations necessary.                                                        inspecting entity names and types.
   In our case, we prematch entities between two pathway graph
representations, which reduces the computational complexity of               Terminology in the Pathway Ontology can be used to describe
performing a global graph alignment. We used the GEDEVO                    the relationship between this pair of pathways. Both pathways
software tool from the Computational Systems Biology Group of the          are examples of PW:0001074, named “hydrophobic amino acid
Max Planck Institute for Informatics in Saarbruecken (Ibragimov            metabolic pathway.” The HumanCyc pathway is an instance of the
et al., 2013). This tool takes two graph representations as input,         PW leaf node PW:0001284, or “tyrosine degradation pathway.” The
and calculates their GED along with a global graph alignment.              Reactome pathway consists of elements of both the PW leaf node
The GED is normalized to between 0 and 1, with 1 indicating a              “phenylalanine degradation pathway” (PW:0001288) and the PW


4
leaf node “tyrosine degradation pathway” (PW:0001284). Using                     From our clustering results, we can observe several different
the organization of the PW, we find that the Reactome pathway                 relationships between pathways. Some pathways describe similar
could potentially be broken down into two constituent parts, the              processes, and show good entity overlap, such as the example
conversion of phenylalanine into tyrosine, and the subsequent                 given in figure (3). These overlapping pathways (A and B) are
degradation of tyrosine.                                                      both instances of the same pathway class C (if the PW were
   A simple example of the benefit of a unifying ontology such as             adopted, the class would be “oxytocin signaling pathway”). Other
PW is that it would eliminate mismatches due to simple synonyms               pathways show a subset relationship as in figure (2), where one
such as “catabolism” (used by Reactome) and “degradation” (used               pathway can be described as a subprocess of the other pathway,
by HumanCyc). More interestingly, an ontology may allow for a                 exemplifying the part-of relationship. A third case is possible, but
more careful distinction of the relationships between pathways, for           not illustrated, where one pathway is both a subset of another
example, by drawing attention to the tyrosine degradation pathway             pathway and describes the same overall process. This could happen
being a subprocess of the phenylalanine degradation pathway.                  if modelers use different levels of granularity. The subset entities
Although biologically, it may make sense to combine these into                would then be interleaved through the larger pathway as opposed
one pathway, as Reactome does, the duplication of the tyrosine                to forming a tightly connected subnetwork as in the subprocess
degradation subprocess may be problematic for secondary use.                  case. This would still be an example of sibling relationships to
                                                                              some parent class, with the siblings differing in granularity. All
                                                                              three cases: overlap, subprocess, and granularity subset, can be
                                                                              discovered using a combination of entity membership and graph
                                                                              metrics.
                                                                                 Identifying these relationships is an important step to reducing
                                                                              redundancy in pathway data for secondary use. Overlapping
                                                                              pathways could be reduced to a single pathway representation.
                                                                              Pathways containing subprocesses could be modularized into
                                                                              several non-overlapping parts, or subpathways. For example, the
                                                                              Reactome pathway “Phenylalanine and tyrosine catabolism” from
                                                                              figure (2) could be broken down into two subprocesses, the
                                                                              conversion of phenylalanine into tyrosine (the gray entities from the
                                                                              figure), and the degradation of tyrosine (the colored entities from the
  Fig. 3. Graph alignment of two signaling pathways describing the same       figure). PW terms could be used to help identify these relationships
process, where a majority of reactions are shown in both pathways. Entities   between pathways. The PW is-a relationship describes both overlap
  found in both pathways are outlined in black. Gray lines and circles are    and granularity subset relationships, and the PW part-of relationship
  those relationships and entities found only in Panther; dotted gray lines   describes subprocess relationships.
 show relationships only found in WikiPathways. All reactions are labeled        Our initial results are two-fold. First, we demonstrate that useful
   ‘Rx’; all complexes are labeled ’C’; all abbreviations for proteins and    similarity information can be gathered from pathway names and
                     molecules are taken from Panther.                        descriptions. Second, we propose that further similarity information
   Figure (3) shows another example, this time of two overlapping             can be derived by combining a set of measures: pathway names,
pathways, Panther Pathway’s “Oxytocin receptor mediated                       entity membership, and graph edit distance. We demonstrate
signaling pathway” and WikiPathways’ “Oxytocin signaling.” The                this second point with some initial proof-of-concept examples
entity memberships of these two pathways show good overlap                    (Figures (2) and (3)). We also advocate the use of an organizing
(Jaccard index = 0.40). Both of these pathways describe the same              ontology such as the Pathway Ontology to help identify pathway
process, which is denoted by the PW leaf node PW:0000494,                     overlap and subprocess relationships.
or “oxytocin signaling pathway.” There were a few prematched
entities, and the graph alignment produced by GEDEVO was
able to infer several additional entity matches between the two
                                                                              3.1   Limitations & Future Work
pathways. However, the performance seems less good compared to                There are several notable points of potential improvement in the
the previous example due to greater differences in representation             procedures described in this paper. Pathway names and descriptions
between the two pathways.                                                     were used to cluster pathways using tf-idf scores. Stemming and
                                                                              lemmatization could be employed to derive better clustering results.
                                                                              Stemming and lemmatization is the process of reducing words to
3   DISCUSSION                                                                their base form; for example, metabolism, metabolic, and metabolite
Improving the way we discuss and measure similarity among                     all share the same word stem. Prefix and suffix analysis can also
pathway representations will have many repercussions for                      help discover similar classes of words, especially chemical species,
secondary use of pathway resources. Instead of using all pathways             whose types can be derived from suffixes, like -oses (sugars) and
available for pathway analysis, we could eliminate redundant                  -ases (proteins). Especially for pathway names, for which few
pathways and increase the power of our analysis results. We could             words are present, stemming and suffixing could greatly improve
also better organize pathways, thereby making clear where overlap             our measure of name similarity. Additionally, tf-idf scores do not
and subprocess relationships occur. Thus, our work builds from the            represent syntactic or semantic information, causing similar phrases
Pathway Ontology, and we aim to infer similarity and hierarchical             with different key words to cluster together incorrectly. Using a
relationships among pathways across resources.                                greater variety of lexical features could help offset this weakness.


                                                                                                                                                   5
Wang & Gennari



   Another major challenge was entity normalization. In many
                                                                                               Rubin, G. M., and Sherlock, G. (2000). Gene ontology: tool for the unification of
cases, we discovered synonymous entities in two resources that                                 biology. the gene ontology consortium. Nature Genetics, 25, 25–29.
did not share cross-reference identifiers. This could be helped by                          Ballouz, S., Pavlidis, P., and Gillis, J. (2016). Using predictive specificity to determine
extending our entity normalization dictionary. More synonyms can                               when gene set analysis is biologically meaningful. Nucleic Acids Research, 45.
be derived from third-party reference databases, although our usage                         Belinky, F., Nativ, N., Stelzer, G., Zimmerman, S., Stein, T. I., Safran, M., and Lancet,
                                                                                               D. (2015). Pathcards: multi-source consolidation of human biological pathways.
of BridgeDB identifier mapping services already does this to some                              Database, 2015.
degree. We can also infer entity equivalence or synonymy using                              Cerami, E. G., Gross, B. E., and et al, E. D. (2011). Pathway commons, a web resource
other information, such as the entity’s name, or the reactions in                              for biological pathway data. Nucleic Acids Res, 39(Database issue), D685–690.
which it participates. The calculation of a global graph alignment                          Christen, P. (2011). A survey of indexing techniques for scalable record linkage and
                                                                                               deduplication. IEEE Transactions on Knowledge and Data Engineering, 24, 1537–
is one way to derive potential synonyms. The graph alignment
                                                                                               1555.
algorithm employed by GEDEVO does not take into account                                     Croft, D., Mundo, A. F., and et al, R. H. (2013). The reactome pathway knowledgebase.
features such as node name or type, which may help in identifying                              Nucleic Acids Res, 42(Database issue), D472–477.
more synonyms. The inference of identifier synonymy through                                 Degtyarenko, K., de Matos, P., and et al, M. E. (2008). Chebi: a database and ontology
alternative means could also potentially be used to identify missing                           for chemical entities of biological interest. Nucleic Acids Res, 36(Database issue),
                                                                                               D344–350.
cross-reference identifiers in reference ontologies.                                        Demir, E., Cary, M., and et al, S. P. (2010). The biopax community standard for pathway
   Lastly, we hope to provide a platform for exploring the overlaps                            data sharing. Nature Biotechnology, 28(9), 935–42.
among these pathways and to allow for the generation of pathway                             Frolkis, A., Knox, C., Lim, E., Jewison, T., Law, V., Hau, D. D., Liu, P., Gautam, B., Ly,
data sets with reduced redundancies among member pathways.                                     S., Guo, A. C., Xia, J., Liang, Y., Shrivastava, S., and Wishart, D. S. (2010). Smpdb:
                                                                                               The small molecule pathway database. Nucleic Acids Research, 38, D480–D487.
Such an interface would allow the user to search for pathways
                                                                                            Green, M. L. and Karp, P. D. (2006). The outcomes of pathway database computations
from multiple sources, adjust the degree to which similar pathways                             depend on pathway ontology. Nucleic Acids Research, 34, 3687–3697.
should be merged into superpathways or broken down into non-                                Grego, T., Ferreira, J. D., Pesquita, C., Bastos, H., Vila Vicosa, D., Freire, J., and
overlapping subpathways, and export the resulting pathways for                                 Couto, F. M. (2010). Chemical and metabolic pathway semantic similarity. FC-DI
secondary use. For example, the user could generate unique                                     - Technical Reports.
                                                                                            Ibragimov, R., Malek, M., Guo, J., and Baumbach, J. (2013). Gedevo: An evolutionary
gene sets for GSEA or other types of pathway-based enrichment                                  graph edit distance algorithm for biological network alignment. GCB, 2013, 68–79.
analysis, or create novel explanatory pathways using the non-                               Kamburov, A., Wierling, C., Lehrach, H., and Herwig, R. (2009). Consensuspathdb – a
overlapping segments of existing pathways. A user interface could                              database for integrating human functional interaction networks. Nucleic Acids Res,
also leverage the work of the Pathway Ontology for organizing or                               37(Database issue), D623–628.
                                                                                            Kanehisa, M. and Goto, S. (2000). Kegg: Kyoto encyclopedia of genes and genomes.
annotating pathways from different databases. An evaluation of the
                                                                                               Nucleic Acids Res, 28, 27–30.
usefulness and correctness of identified overlaps and similarities                          Khatri, P., Sirota, M., and Butte, A. J. (2012). Ten Years of Pathway Analysis: Current
between pathways can be conducted formally through a qualitative                               Approaches and Outstanding Challenges. PLOS Comput Biol, 8(2), e1002375.
assessment of biologists and their interactions with various merged                         Kutmon, M., Riutta, A., and et al, N. N. (2016). Wikipathways: capturing the full
pathway representations through this proposed platform.                                        diversity of pathway knowledge. Nucleic Acids Res, 44(D1), D488–D494.
                                                                                            Liberzon, A., Birger, C., Thorvaldsdóttir, H., Ghandi, M., Mesirov, J. P., and Tamayo,
3.2     Conclusion                                                                             P. (2015). The molecular signatures database (msigdb) hallmark gene set collection.
                                                                                               Cell Systems, 1, 417–425.
Understanding similarities and redundancies among pathway                                   Maglott, D., Ostell, J., Pruitt, K. D., and Tatusova, T. (2011). Entrez gene: gene-
representations is critical for improving the quality of secondary                             centered information at ncbi. Nucleic Acids Research, 39, D52–57.
analyses performed using pathway resources. Associations among                              Muto, A., Kotera, M., Tokimatsu, T., Nakagawa, Z., Goto, S., and Kanehisa, M. (2013).
                                                                                               Modular Architecture of Metabolic Pathways Revealed by Conserved Sequences of
different pathways can be deduced by studying the features
                                                                                               Reactions. J. Chem. Inf. Model., 53(3), 613–622.
of each individual pathway, such as its name, description,                                  Peregrn-Alvarez, J. M., Sanford, C., and Parkinson, J. (2009). The conservation and
entity membership, and topological structure. A hierarchical                                   evolutionary modularity of metabolism. Genome Biology, 10, R63.
organizational structure such as the Pathway Ontology is a useful                           Petri, V., Jayaraman, P., Tutaj, M., Hayman, G. T., Smith, J. R., Pons, J. D.,
way to organize pathways. Here, we have shown that an analysis of a                            Laulederkind, S. J. F., Lowry, T. F., Nigam, R., Wang, S.-J., Shimoyama, M.,
                                                                                               Dwinell, M. R., Munzenmaier, D. H., Worthey, E. W., and Jacob, H. J. (2014).
combination of features (names, entities and graph topology) could                             The pathway ontology updates and applications. Journal of Biomedical Semantics,
be used to infer similarity and relationship information between                               5.
pathways. Our goal is to provide an umbrella organizational                                 Romero, P., Wagg, J., Green, M. L., Kaiser, D., Krummenacker, M., and Karp, P. D.
structure across multiple pathway databases that will make it                                  (2004). Computational prediction of human metabolic pathways from the complete
                                                                                               human genome. Genome Biology, 6(R2), 1–17.
easier for researchers to use pathways with appropriate content and
                                                                                            Schaefer, C. F., Anthony, K., Krupa, S., Buchoff, J., Day, M., Hannay, T., and Buetow,
granularity.                                                                                   K. H. (2009). Article navigation pid: the pathway interaction database. Nucleic
                                                                                               Acids Research, 37, D674–D679.
ACKNOWLEDGEMENTS                                                                            Thomas, P. D., Campbell, M. J., and et al, A. K. (2003). Panther: a library of protein
                                                                                               families and subfamilies indexed by function. Genome Res, 13, 2129–2141.
This study was supported in part by the National Library of                                 van Iersel, M. P., Pico, A. R., Kelder, T., Gao, J., Ho, I., Hanspers, K., Conklin, B. R.,
Medicine (NLM) Training Grant T15LM007442.                                                     and Evelo, C. T. (2010). The bridgedb framework: standardized access to gene,
                                                                                               protein and metabolite identifier mapping services. BMC Bioinformatics, 11.
                                                                                            Vivar, J. C., Pemu, P., McPherson, R., and Ghosh, S. (2013). Redundancy control
REFERENCES                                                                                     in pathway databases (recipa): An application for improving gene-set enrichment
Apweiler, R., Bairoch, A., and et al, C. H. W. (2004). Uniprot: the universal protein          analysis in omics studies and “big data” biology. OMICS, 17, 414–422.
  knowledgebase. Nucleic Acids Res, 32(Database issue), D115–119.                           Wang, L. L., Gennari, J. H., and Abernethy, N. F. (2016). An analysis of differences in
Ashburner, M., Ball, C. A., Blake, J. A., Botstein, D., Butler, H., Cherry, J. M., Davis,      biological pathway resources. Proceedings of the Joint International Conference on
  A. P., Dolinski, K., Dwight, S. S., Eppig, J. T., Harris, M. A., Hill, D. P., Issel-      Biological Ontology and BioCreative, 2016.
  Tarver, L., Kasarskis, A., Lewis, S., Matese, J. C., Richardson, J. E., Ringwald, M.,




6