An analysis of different ontological approaches to describe renal mutant phenotypes Kirsty Lee*1 and Duncan Davidson2 1 2 School of Informatics, University of Edinburgh, Edinburgh, EH8 9LW and MRC Human Genetics Unit, Western General Hospital, Crewe Road, Edinburgh, UK, EH4 2XU *corresponding author Email: Kirsty Lee - k.a.lee@sms.ed.ac.uk; Duncan Davidson - Duncan.Davidson@hgu.mrc.ac.uk Abstract Ontologies have increasingly been used in the representation of a variety of biological data. The major alternative ontologies available for mouse phenotype description are the MPO (Mammalian Phenotype Ontology) and PaTO (Phenotype and Trait Ontology). Ontologies have the potential of contributing to the analysis of mutant phenotypes by providing a framework for reasoning. However, any reasoning task will be of limited value if a phenotype ontology cannot represent the majority of phenotypes in publications accurately and in sufficient detail. Therefore, it is important to investigate the accessibility and expressivity of phenotype ontologies, firstly to ensure the scope and consistency of phenotype databases but also as a prerequisite for meaningful automatic reasoning methods. Accessibility in this context is used to refer to the ‘ease of use’ or how easy it is for researchers to encode phenotype descriptions using the ontology. There have not yet been any published case studies specifically comparing the suitability of current phenotype ontologies for accurately capturing and representing phenotypes using real data sets. This paper incorporates the findings of a 6-month case study which explored potential methods of phenotype description for the EuReGene (European Renal Genome) project. The project uses mouse, rat, zebrafish and Xenopus models to examine gene expression patterns and phenotypes relevant to human kidney disease. During the course of the case study, it was possible to visit the participating laboratories which gave a unique and pragmatic insight into how phenotype ontologies can match the requirements of the mouse research community. Ontologies should be able to contribute to 1 Introduction the analysis of mutant phenotypes by providing The abundance of phenotypic data emerging a framework for reasoning. However, any from mouse mutagenesis screens [1][2] implies reasoning task will be of limited value if a a need to describe phenotypes in a way that is phenotype ontology cannot represent the amenable to computational comparison. majority of phenotypes in publications Phenotype comparison is imperative in order to accurately and in sufficient detail. Therefore, it study the underlying genetic mechanisms, and is important to investigate the accessibility and may involve identifying subtle differences expressivity of phenotype ontologies, firstly to between mutant phenotypes. When phenotypic ensure the scope and consistency of phenotype descriptions come in the form of free text, databases but also as a prerequisite for placing lexical and syntactic constraints on them meaningful automatic reasoning methods. may allow for a more effective comparison. Accessibility in this context is used to refer to Recently, ontologies have provided these the ‘ease of use’ or how easy it is for researchers constraints and have increasingly been used in to encode phenotype descriptions using the the representation of a variety of biological data ontology. [3]. The major alternative ontologies available ‘Ease of use’ and expressivity have been for mouse phenotype description are the MPO highlighted as requirements for the OWL web (Mammalian Phenotype Ontology) [4] and ontology language, set out in 2004 [6]. PaTO (Phenotype and Trait Ontology) [5]. Regarding expressivity, the OWL guidelines state that “the language should be as expressive 1 as possible, so that users can state the kinds of peripheral to the main discussion. Phenotypes knowledge important to their applications”. are recorded at the various EuReGene research Regarding the accessibility of the ontology, “the centres and the descriptions are contained in language should provide a low learning barrier spreadsheets or described using free text. There and have clear concepts and meaning”. Of is an obvious role for phenotype ontologies in course, the accessibility of an ontology is standardising the phenotype descriptions. Thus, dependent on the prior knowledge of the the EuReGene project provides an opportunity annotator. However, if ontologies are to be used to examine both the accessibility and on a large scale, then researchers who may not expressivity of current phenotype ontologies. been directly involved in ontology development could annotate the phenotype data. Therefore, it 3 EuReGene Phenotype Data is desirable to make the annotation process as easy as possible for a non ontology expert. The tenets set out in the OWL guidelines are The EuReGene phenotype data set currently equally applicable to phenotype ontologies. consists of 20 mouse models with 121 However, there have not yet been any published phenotype descriptions. The majority of case studies specifically comparing the EuReGene phenotype descriptions were for suitability of current phenotype ontologies for adult mice and relate to kidney physiology; accurately capturing and representing some developmental phenotypes were also phenotypes using real data sets. For the Gene described. Phenotype data sheets were ontology, Dolan et al. (2005) have developed a submitted by 16 EuReGene partners. The procedure to address annotation inconsistency completion of each data sheet required using orthologous mouse and human genes researchers to describe phenotypic [7][7] although there are not yet any similar characteristics of their model, using free text i.e. studies for phenotype ontologies. without selecting ontology terms. It was This paper will incorporate the findings of a important that researchers described the 6-month case study which explored potential phenotype using free text so that in case of any methods of phenotype description for the information loss during the annotation process, EuReGene project (described further in Section the descriptions remained full and accurate. The 2) [8]. During the course of the case study, it assays/experimental methods used for phenotype was possible to visit the participating detection were also described on each sheet. laboratories which gave a unique and pragmatic Genetic information relating to each animal insight into how phenotype ontologies can model was also recorded, such as the targeted match the requirements of the mouse research gene and the type of genetic manipulation used. community. Table 1 shows the headings used on each sheet and examples of entries under each heading. 2 EuReGene Project Overview 4 Encoding EuReGene phenotype data The EuReGene (European Renal Genome) project was established in 2005 to bring together 4.1 Ontologies used for encoding phenotypes expertise from across Europe in order to study kidney function in the normal and diseased states [8]. The project uses mouse, rat, zebrafish Since the main ontologies available for mouse and xenopus models to examine gene expression phenotype descriptions are the MPO patterns and phenotypes relevant to human (Mammalian Phenotype Ontology) and PaTO kidney disease. Part of the project remit is to (Phenotype and Trait Ontology), these have been allow access to the research outcomes via used to annotate the EuReGene data. databases available on the project website. Gene The MPO has been developed by the Mouse expression and phenotype data are major Genome Informatics (MGI) group at the Jackson components of these research outcomes. One Laboratory to describe both in-house and important task is to link gene expression and external mouse phenotype data from biomedical phenotype data which should be achievable by research literature. The MPO contains 57691 annotating both with a common anatomy ontology. However, as this research is concerned with phenotype description, 1 Recorded November 2007 discussion of gene expression data will be 2 terms for describing phenotypes which are level (mouse), system level (cardiovascular organised into a directed acyclic graph (DAG). system), tissue level (renal interstitium group) The MPO uses atomic terms with each designed and organ level (metanephros). Eight phenotype to encapsulate a complete phenotype description descriptions could not be annotated with an such as ‘polyuria’. In contrast, PaTO is a set of EMAP term (shown in Table 3). The first three terms to describe phenotypic qualities. (size, of these were examples where a type of cell or shape or colour for example) and is designed to anatomical part was described rather than a be used in combination with several other particular named instance. An example is “bone ontologies. Using an established ontology such calcification defects”. There were four as the MPO holds an advantage over developing examples of phenotypes at the protein level, for an in-house ontology, as the data can be linked example “normal renin expression”. As EMAP with external bioinformatics resources such as is an anatomical ontology, it is unsuitable for phenotype data held at the MGI. For a detailed annotating these phenotypes. The final example review of the MPO see [4]. is ‘hilar artery’ which could not be found in Gene expression data produced by EMAP. However, there was no apparent reason EuReGene members has previously been for the absence of hilar artery. annotated using EMAP (an anatomical ontology used by the EMAGE database to describe the 5.2 Suitability of MPO developing mouse embryo [9][10]). Thus we can associate phenotypes with related gene expression patterns using EMAP. Figure 2 shows the adequacy of the MPO for describing EuReGene phenotypes. In 45 out of 4.2 The Encoding Process 121 phenotype descriptions an MPO term was found which was either synonymous or an exact replication of the free text phenotype Each EuReGene phenotype description was description. There were 44 examples where the annotated (where possible) using the EMAP, MPO could be used but the annotated term was MPO and PaTO ontologies. Since PaTO is at a higher granularity (less specific) than the intended to be used in a compositional free text description. For 33 examples, there framework with other ontologies, there were was no appropriate MPO term available. also three other ontologies used: the Gene Ontology [11], Cell Type Ontology [12] and the - MPO Strengths ChEBI (Chemicals of Biological Interest) Before considering the weaknesses of the MPO, ontology [13]. However, these were used this section considers where the strengths of the infrequently (see Table 2) and the majority of MPO lie. Table 4 shows examples where the PaTO qualities were used in conjunction with MPO (but not PaTO) was able to describe the EMAP anatomical terms. To avoid any phenotype. The MPO is able to annotate all annotator bias, each annotation was verified by clinical descriptions in the EuReGene data set, the researchers involved in creating the mouse for example ‘holoprosencephaly’. Three of the models, who had intimate knowledge of the five examples in Table 4 (which are only phenotype. Some phenotype descriptions were annotated by the MPO) are clinical descriptions mapped to two ontology terms but most had a which have corresponding MPO terms but are one-to-one mapping. 121 tuples were created difficult to describe using PaTO within a for describing EuReGene phenotypes relating to compositional framework. a specific genotype. Table 2 shows the number In some cases, PaTO provides an of terms from each ontology used to describe the approximate description for clinical descriptions. EuReGene phenotypes. However, clinical terms in the EuReGene data set are expressed more accurately using the 5 Suitability of ontologies to MPO, as exemplified by “hydrops fetalis” encode EuReGene descriptions (defined by the MPO as “an abnormal accumulation of serous fluid in fetal tissues”). 5.1 Suitability of EMAP An approximate description of “hydrops fetalis” using EMAP and PaTO is ‘mouse’ + ‘edematous’. However, the PaTO description is Figure 1 shows that the most commonly less accurate than the MPO since fetal tissue is annotated EMAP term was ‘metanephros’. not incorporated in the description. There are Annotated EMAP terms were at the organism further examples of clinical terms which can be 3 annotated with both ontologies but are much term was suitable (corresponding to the fourth more intuitively annotated using the MPO, such category in Figure 2.). In order to make general as “hypercalciuria”. These examples are inferences which could apply to phenotypes discussed in relation to the suitability of PaTO in outwith EuReGene, the reasons for non- Section 5.3. annotation have been categorised as follows: - MPO Weaknesses 1. The MPO could not describe a In order to identify the weaknesses of the MPO, normal phenotype referring to a it is necessary to study examples where the particular process or anatomical part MPO was not able to describe the phenotype. However before these examples are discussed, it Example: “normal electrolyte levels in the is also important to consider examples where the blood” MPO annotation resulted in a less specific phenotype description. These examples are 2. The MPO could not describe the shown in Table 5 and correspond to the third absence of a particular anatomical category shown previously in Figure 2. For each part or process example, the text in bold shows where the specificity has been lost. The curly brackets Examples: “no apoptosis”, “no after the MPO term in Table 5 point to the haemorrhage” general type of term where the specificity was lost. In 6 examples, the anatomical specificity is 3. The granularity of the phenotype insufficient and in 7 examples the specificity of description meant that no MPO term the quality (e.g. tortuosity) has been lost. was available. Although the MPO may not aim to describe phenotypes at a detailed cellular level, in an Examples: “SorLA protein upregulation”, example such as “low molecular weight “Renal Fanconi syndrome”, “cardiac proteinuria” there is information lost after conotruncal defects” annotation with the MPO term ‘proteinuria’. The presence of “low molecular” in this example 4. No reason could be established and allows useful distinctions to be made regarding additional terms should be added to the underlying kidney filtration processes. the MPO Many of the EuReGene phenotypes are biochemical measurements made in the blood and urine. The two main functions of a kidney Currently, the MPO terms available for are to filter the blood and excrete waste products describing normal phenotypes are ‘normal in the form of urine. Thus the disruption of phenotype’ and ‘no abnormal phenotype normal kidney function can be identified by detected’. However, there is no MPO term to examining the concentration of various describe a normal phenotype with reference to a substances, such as amino acid, in the blood and particular process or anatomical part such as urine. The clinical term ‘aminoaciduria’ is ‘normal electrolyte levels in the blood’. As a commonly used to describe an increase in the result, MPO annotators use terms which are also concentration of amino acid present in the urine. used to specify the anatomy without any Equivalent terms for calcium and protein abnormality. For example ‘muscle phenotype’ concentrations are ‘hypercalciuria’ and is used to annotate the free text description “no ‘proteinuria’. Similar terms for describing ion obvious muscle abnormalities”. A similar concentrations in the blood are ‘hypercalcemia’ example is the free text description “at E18.5 (high calcium) and ‘hypokalemia’ (low there are no discernable gross abnormalities in potassium). The MPO incorporates these and the kidneys” which has been annotated with other similar clinical terms and thus minimal ‘renal/urinary system phenotype’. These translation from free text is required. In examples may cause problems for any future contrast, using PaTO to annotate this category of reasoning as ‘muscle phenotype’ is being used to phenotypes is slightly more cumbersome and is describe both normal phenotypes (as shown described below when considering PaTO above) and abnormal phenotypes. The same is annotation. true for ‘renal/urinary system phenotype’ and Having considered examples where the probably other similar terms. MPO description was less specific, it is now appropriate to consider examples where no MPO 4 Because the MPO must add a term to describe General phenotypes the absence of every process or anatomical part, it is less likely to have a term linked to every PaTO can describe the general dysfunction of process mentioned in GO or every anatomical any continuant entity (e.g. anatomical part) part included in EMAP. Thus, the extensibility using the quality ‘decreased functionality’. of the MPO compared to PaTO means that the ‘Functionality’ can be associated with a term former is less likely to be able to describe absent from an external ontology such as EMAP. As entities. the EMAP anatomy ontology tends to have a Seven examples where the granularity was a higher granularity than the anatomy terms in the problem were related to a protein. However, MPO, “generalized proximal tubule there were also two examples where the dysfunction” (Example 5) can be described granularity of the free text description was too using PaTO but not the MPO. Generally, PaTO general to be described in the MPO. These were is more suited to describing phenotypes that are “Renal Fanconi syndrome” and “cardiac general abnormalities as it is not as constrained conotruncal defects”. by the granularity of the anatomical/process description. The entities in these cases are only constrained by the terms available in other OBO 5.3 Suitability of PaTO ontologies rather than the coverage of the MPO. Since these other ontologies tend to contain more specific anatomical parts (for example In comparison with the MPO, there are EMAP) and processes (GO), a higher specificity marginally more examples which can be of phenotype description is achieved. annotated with PaTO (shown in Figure 3). Due to the compositional nature of the ontology, Process phenotypes there were no examples where the ontological description exactly matched the free text Examples 6-7 show an advantage of the description, as was the case with the MPO (in compositional approach by allowing the 14% of examples). However, a much higher description of a process (endocytosis) within a proportion (63%) of the examples are specific anatomical part (renal proximal tubule). synonymous with the free text descriptions, PaTO is able to describe many different features compared with 23% for the MPO. of processes (impaired and abolished in these examples) whilst also describing where they - PaTO strengths occur. Consistent with the earlier examples, provided external ontologies can match the In order to identify the strengths of PaTO, it is desired specificity, PaTO is more flexible and useful to examine the phenotype descriptions able to cope with many different processes in which could only be annotated using PaTO. combination with various permutations of how Table 6 contains examples which were and where they are affected. annotated with PaTO but not the MPO. Examples where no appropriate entity term Other examples existed (although PaTO quality terms did) are also included in Table 6. Unlike the earlier examples which demonstrate the advantages of the compositional nature of Absent phenotypes PaTO, Examples 8-13 do not reflect a particular design advantage of one ontology over the other. Examples 1-3 in Table 6 describe the absence of Instead, they reflect cases where appropriate either a process (apoptosis) or anatomical part terms were available in PaTO but not in the (nephron, renal vesicle). These are easier to MPO, for example ‘drug_response’ which is annotate using PaTO since ‘absent’ can be used to describe Example 9. applied to any entity which is available in EMAP/GO or other external ontology. Example - PaTO weaknesses 4 which describes the absence of a haemorrhage is slightly more difficult as an entity for Table 7 shows examples where the PaTO haemorrhage could not be found. description was less specific than the free text description. These examples correspond with the second category shown in Figure 3. In several examples, supplementing PaTO with additional terms would remove this information 5 loss. In the first four examples, the specificity of be applied to any entity, whereas we have seen the quality is lost, for example the quality term that the MPO does not contain any terms for ‘tortuosity’ should be added to describe relating a normal phenotype to a specific entity. “increase in vessel tortuosity”. Where PaTO Although PaTO can describe any process, could not be used to describe a phenotype, the cell or anatomical part using ‘abnormal’, there reasons were categorised as shown below. remains a difficulty with representing many Examples are given for each category, except normal qualities. In the EuReGene data set, where there was no obvious reason and thus no normal concentration phenotypes illustrated the commonality between the descriptions. difficulty. However, this could equally apply to qualities such as ‘amount’, ‘temperature’, ‘size’; 1. PaTO could not annotate because the indeed any other quality which can be phenotype was a clinical description. increased/decreased (currently there are 72 of these) can also be normal. Thus, at least 72 Examples: “holoprosencephaly” terms could be added to the PaTO ontology “renal tubular acidosis” which would allow an increased range of normal “cardiac conotruncal defects” phenotypes to be expressed. A similar situation arises with abnormal phenotypes where the 2. PaTO could not annotate because the increase or decrease in a quality may be phenotype described protein or unspecified. mRNA expression. Examples: “overexpression of Cd2ap 6 Discussion and Nphs2 mRNA” “normal angiotensinogen expression” 6.1 EMAP Expressivity (also an example of category 3 below) “normal renin expression” (also an example of category 3 below) EMAP appeared to be capable of annotating the majority of EuReGene phenotypes. Protein 3. PaTO could not represent a normal expression phenotypes and those where a phenotype related to a specific entity. general type of entity was included (e.g. bone) were the only problematic descriptions. The Example: “normal urine concentrating only anatomical phenotype term unavailable was ability” ‘hilar artery’ which should be added to the EMAP ontology. 4. No obvious reason 6.2 Expressivity using a compositional framework It is often significant if a normal phenotypic result appears on a mutant background since it may show dependence of the phenotype on environmental factors or development stage. Or PaTO allows the flexibility of combining any in a double mutant the effects of a second gene anatomical, process or cell entity with any may compensate for the actions of the first phenotypic quality which resulted in a higher producing an apparently normal phenotype. A proportion of annotation within the EuReGene normal phenotype may also signify that the data set. However, with PaTO annotation there mutant allele is not involved in the biological is a danger that the power of combining multiple process or function which is being studied. ontologies will be lost if the entity essentially Several normal phenotypes exist in the describes the phenotype and the quality is EuReGene data set, for example “normal urine ‘abnormal’. concentrating ability” and “no abnormal PaTO offers flexibility by combining terms electrolyte concentrations in blood”. PaTO was from multiple ontologies. There were 39 able to describe a higher proportion of normal examples in the EuReGene dataset where two phenotypes than the MPO. There were 6 entities were required to complete the full examples where the MPO could not describe the phenotype description using PaTO qualities. 29 normal phenotype and 4 examples where PaTO were phenotypes describing the concentration of could not. PaTO is able to describe normal an entity, 8 described processes, (3 absorption, 4 phenotypes using the term ‘normal’ which can endocytosis, 1 apoptosis), 1 related to a drug response and the final was a general cell type. 6 Many phenotypes involve the alteration of a downstream applications of the phenotype process within a specific anatomical context. ontology. It may be appropriate to use MPO for For example, “impaired proximal tubular annotation and then link this to an underlying endocytosis” describes the impairment of the PaTO description which can be used for process ‘endocytosis’ but also specifies that this reasoning. This would rely on an accurate occurs in the ‘proximal tubule’. PaTO is well translation from the MPO to PaTO. Efforts designed for describing the temporal features of towards formalising this translation have begun a process using terms such as ‘arrested’ or within the National Centre for Biomedical ‘delayed’. However, there is an additional Ontology [14]. requirement for a mechanism to describe where It is important to enable collaboration the process is affected. In practice, annotators between biology researchers and phenotype using PaTO often use several ‘entity’ terms from annotators. Developing an expertly curated various ontologies to complete the full phenotype data set relies on the insight of phenotype description. In the example above we researchers who have in-depth knowledge of could use the GO process term ‘endocytosis’ each phenotype. To ensure consistent use of supplemented with the anatomy term ‘proximal ontology terms, an independent researcher with tubule’ which can then be used with a PaTO knowledge of phenotype ontologies is ideal. quality (impaired) to form the full phenotype Other possibilities for ensuring consistency description. This process has been termed would be to develop detailed, publicly available “post-composition”. However, post-composition guidelines on annotation and/or an online in this context could be fairly arbitrary. submission system which suggests previously Descriptions are post-composed at the point curated ontology terms. of annotation and term associations are not necessarily approved by the ontology developers Acknowledgements at this point. This differs from the MPO where EuReGene (www.euregene.org) is an Integrated although terms may be more constrained, they Project funded by the are fixed in the ontology. By allowing the EC as part of the Framework program 6 (FP6 flexibility of post-composing descriptions ad 005085). hoc, the rigour of the ontology may be compromised. Therefore, there should be a more formal structure for post-composition to prevent many synonymous phenotypes being annotated with slightly different combinations of ontology terms. 6.3 Usability The MPO and PaTO are based on two distinct approaches to phenotype description. The evolution of MPO has been driven by the phenotype descriptions appearing in journal articles resulting in many MPO terms closely resembling free text descriptions. From an annotation perspective, it is advantageous to use ontology terms which mirror descriptions already established in the field such as ‘renal tubular acidosis’. In the majority of examples, it was simpler to translate the descriptions provided by EuReGene researchers to MPO terms. This became apparent when discussing appropriate ontology terms with EuReGene partners and is confirmed by the higher proportion of free text descriptions which are exactly the same as the MPO term, shown in Figure 2. However, this does not take into consideration the possible 7 7 References [1] The European Mouse Mutagenesis Consortium: The European dimension for the mouse genome mutagenesis program. Nature Genetics 2004, 36(9): 925-7. [2] The International Mouse Knockout Consortium: A Mouse for all reasons. Cell 2007, 128: 9-13. [3] Open Biomedical Ontologies (OBO) [http://www.obofoundry.org/] [4] Smith CL, Goldsmith C-AW, Eppig JT: The Mammalian Phenotype Ontology as a tool for annotating, analyzing and comparing phenotypic information. Genome Biology 2005, 6(1): R7 [5] PaTO (Phenotype and Trait Ontology) [http://www.obofoundry.org/cgi- bin/detail.cgi?id=quality] [6] OWL Web Ontology Use Cases and Requirements (2004) [http://www.w3.org/TR/webont-req] [7] Dolan ME, Ni L, Camon E, Blake JA: A procedure for assessing GO annotation consistency. Bioinformatics 2005, 21(Suppl 1):i136-43 [8] Willnow, T. et al. The European Renal Genome Project: An Integrated Approach Towards Understanding the Genetics of Kidney Development and Disease. Organogenesis 2005, 2:2: 42-47 [9] Christiansen JH, Yang Y, Venkataraman S, Richardson L, Stevenson P, Burton N, Baldock RA & Davidson DR: EMAGE: A spatial database of gene expression patterns during mouse embryo development. Nucleic Acids Res. 2006, 34: D637-41 [10] Baldock RA, Bard JB, Burger A, Burton N, Christiansen J, Feng G, Hill B, Houghton D, Kaufman M, Rao J, Sharpe J, Ross A, Stevenson P, Venkataraman S, Waterhouse A, Yang Y, Davidson DR.: EMAP and EMAGE: a framework for understanding spatially organized data. Neuroinformatics 2003, 1(4): 309-25 [11] Harris, M.A. et al. Gene Ontology Consortium. The Gene Ontology (GO) database and informatics resource. Nucleic Acids Res. 2004, 32 (Database issue): D258- 261 [12] Bard, J. et. al.: An Ontology for Cell Types. Genome Biology 2005 6:R21 [13] Chemicals of Biological Interest (ChEBI) [www.ebi.ac.uk/chebi/] [14] National Centre for Biomedical Ontology [http://www.bioontology.org/] 8 Tables Table 1. EuReGene phenotype description headings with examples PHENOTYPE DESCRIPTION EXAMPLE SHEET HEADING Gene targeted Clcn5 Type of genetic manipulation Knock out Spatial and structural Proximal tubule information (subcellular localization to endosomes) Major phenotypes, in relation  Low molecular weight proteinuria with structural defects  Generalized aminoaciduria  Glycosuria  Hypercalciuria- renal calcium deposits  Increased bone turnover  Impaired proximal tubular endocytosis Assays used to determine the Standard chemistry, immunohistochemistry phenotypes Biological data available on Pool basis individual or pool basis? Publications using the mouse Pubmed ID : 11115837 9 Table 2. The number of terms annotated to EuReGene phenotypes for each ontology. GO, cell type and ChEBI ontologies were used in conjunction with PaTO terms where an anatomical term was not appropriate. The number of phenotype descriptions which could not be annotated is shown in brackets. Since the final three ontologies were only used in conjunction with PaTO qualities, numbers in brackets are not included for these. NUMBER OF DISTINCT TERMS USED TO Ontology ANNOTATE PHENOTYPES Edinburgh Mouse Atlas Project 31 (8) (EMAP) Ontology Mammalian Phenotype Ontology 53 (32) (MPO) Phenotype and Trait Ontology (PaTO) 32 (23) Gene Ontology (GO) 6 Chemical Entities of Biological Interest 2 (ChEBI) Cell type Ontology 1 Table 3. Examples of free text phenotype descriptions where no EMAP term was available. REASON WHY EMAP WAS FREE TEXT PHENOTYPE DESCRIPTION UNSUITABLE bone calcification defects Anatomical term was a general type increased bone resistance Anatomical term was a general type stromal cell defect (immediately adjacent to renal Anatomical term was a general type pelvis region) normal angiotensin 1 Protein level could not be described normal angiotensinogen expression Protein level could not be described normal renin expression Protein level could not be described enhanced processing of amyloid precursor protein Protein level could not be described (APP) to amyloid in neurons in the brain hilar artery calcification Missing term – unknown reason 10 Table 4. EuReGene phenotype descriptions (free text) which were annotated by only the MPO. Exact match and synonymous MPO annotations are included (first 2 categories shown in Figure 2). Definitions are included for each MPO term. FREE TEXT MPO TERM MPO TERM DEFINITION PHENOTYPE DESCRIPTION distal renal tubular renal tubular acidosis a clinical syndrome characterized by the acidosis inability to acidify urine holoprosencephaly holoprosencephaly presence of a single forebrain hemisphere or lobe; often accompanied by a deficit in median facial development polyhydramnios polyhydramnios abnormally high amniotic fluid volume; may result from maternal diabetes, chromosomal abnormalities or other congenital abnormalities high lethality premature death after weaning age, but before the death/postnatal normal life span/premature death anytime lethality after postnatal day 1 to weaning age concentration abnormal urine changes in the concentration of ions in the defect osmolality urine compared to the normal state 11 Table 5. Examples where the MPO term was less specific than the free text description. The curly brackets point to the category of term which could not be expressed with sufficient specificity. Free text description MPO term increase in vessel number abnormal vasculature increase in vessel tortuosity no obvious histological lesions no abnormal phenotype detected noted (glomerulus) quality renal iron deposits abnormal kidney iron level tubular atrophy renal tubular necrosis podocyte hypertrophy abnormal podocyte podocyte vacuolization altered vascular abnormal vascular endothelial activity/abnormal calcium cell physiology process signalling bone calcification defects abnormal bone mineralization rare crescent formation abnormal renal glomerulus morphology peritubular capillary abnormal vascular regression regression arcuate artery calcification arterial calcification anatomical focal and segmental glomerulosclerosis glomerulosclerosis calcification in the renal papilla kidney calcification increased plasma vitamin D3 abnormal vitamin level urinary excretion of vitamin A abnormal retinol metabolism vitamin bound to retinal binding protein interstitial haemorrhage kidney hemorrhage actin-expressing smooth muscle abnormal cell differentiation cells fail to differentiate cellular increased mesangial matrix abnormal mesangial cell low molecular weight proteinuria protein proteinuria 12 Table 6. EuReGene phenotype descriptions (free text) which were annotated by PaTO only. (Corresponds with first category shown in Figure 3). EXAMPLE FREE TEXT DESCRIPTIONS WHERE ONLY PATO COULD ANNOTATE NO. 1 nephrons fail to develop/lack of nephrons 2 no apoptosis 3 renal vesicles do not form 4 no haemorrhage 5 generalized proximal tubule dysfunction (Renal Fanconi syndrome) 6 loss of endocytic activity in the renal proximal tubules * 7 impaired proximal tubular endocytosis * 8 stromal cell defect (immediately adjacent to renal pelvis region) 9 impaired response to loop and thiazide diuretics 10 vascular fragility 11 decreased lithium clearance 12 increased chloride excretion * 13 increased aldosteronuria * denotes that 2 examples of this free text description were present in the EuReGene data set 13 Table 7. Examples of phenotype descriptions where using PaTO resulted in a loss of specificity. The bold terms indicate where the specificity has been lost. PaTO description Example Free text no. phenotype Entity PaTO parent PaTO child description 1 increase in vessel renal cortical structure disorganized tortuosity arterial system 2 peritubular renal cortical capillary relative quantity decreased capillary regression* 3 no obvious glomerular deviation histologic lesions normal tuft (from normal) noted (glomerulus) 4 perinatal /postnatal mouse viability dead lethality* 5 focal and segmental glomerular structure collapsed glomerulosclerosis tuft ** 6 hydrops fetalis embryo structure edematous 7 microvillus visceral structure degenerate formation epithelium 8 rare crescent Bowman's structure hyperplastic formation capsule 9 visceral foot process deviation epithelium abnormal effacement* (from normal) 10 glomerular mesangiolysis** structure degenerate mesangium * denotes that 2 examples of this free text description were present in the EuReGene data set. ** denotes that 3 examples of this free text description were present in the EuReGene data set. 14 Figures Figure 1. Frequency of EMAP terms in the EuReGene phenotype data set. Terms are listed in ascending order of frequency. arcuate artery Bowman's capsule cardiovascular system forebrain interlobular artery mature nephron pancreas pelvic smooth muscle renal cortical arterial system renal vesicle embryo inner medullary collecting duct renal cortical vasculature renal interstitium group small blood vessels distal convoluted tubule mouse renal cortical capillary renal distal tubule medullary collecting duct cortical collecting duct glomerular mesangium thick ascending limb glomerular tuft visceral epithelium cortical renal tubule renal proximal tubule - S1, S2 urine renal proximal tubule blood metanephros 15 Figure 2. Frequency of MPO annotation for 121 EuReGene phenotypes (originally described using free text). If a free text phenotype description is repeated (e.g. by different labs), both are counted individually. 50 44 45 40 32 35 Number of free 28 30 text 25 descriptions 17 20 15 10 5 0 Exact match Synonymous Less specific No appopriate MPO term MPO term MPO term MPO term Key Exact match MPO term: the annotated MPO term was exactly the same as the free text phenotype description Synonymous MPO term: the annotated MPO term was synonymous with the free text phenotype description Less specific MPO term: the annotated MPO term was less specific than the free text description No appropriate MPO term: there was no appropriate MPO term to describe the phenotype 16 Figure 3. Frequency of PaTO annotation for 121 EuReGene phenotypes (originally described as free text). If a free text phenotype description is repeated (e.g. by different labs), both examples are counted individually. 80 76 70 60 50 Number of free text 40 descriptions 30 21 20 17 7 10 0 Synonymous Less specific No appopriate No appropriate PaTO term PaTO quality PaTO quality entity Key: Synonymous PaTO term: the annotated PaTO term was synonymous with the free text phenotype description Less specific PaTO term: the annotated PaTO term was less specific than the free text description No appropriate PaTO term: there was no appropriate PaTO term to describe the phenotype No appropriate entity: there was no appropriate term for describing the entity part of the PATO description, for example if there was no appropriate GO or cell type term 17