Use of text mining for Experimental Factor Ontology coverage expansion in the scope of target validation Şenay Kafkas, Ian Dunham, Helen Parkinson and Jo McEntyre European Bioinformatics Institute – European Molecular Biology Laboratory (EMBL-EBI), and Open Targets Wellcome Genome Campus Hinxton, CB10 1SD, UK Abstract—Understanding the molecular biology and ontology expansion [3] and target-disease association development of disease plays a key role in drug development. identification [4], to compare terms available in existing Integrating evidence from different experimental approaches ontologies and present our initial results. with data available from public resources (such as gene expression level changes and reaction pathways affected by II. METHODS pathogenic mutations) can be a powerful approach for evaluating different aspects of target-disease associations. The application of A. Resources Used ontologies is of fundamental importance to effective integration. The Target Validation Platform is a user-friendly interface that We used Europe PMC as the literature database, UniProt integrates such evidences from various resources with the aim of for target (gene/protein) names and six major disease assisting scientists to identify and prioritise drug targets. terminologies: EFO (V2.69), the Human Phenotype ontology Currently, the EFO is used as the reference ontology for diseases (HP) (access date:31-03-2016) (http://human-phenotype- in the platform, importing terms from existing disease ontologies ontology.github.io/), Orphanet Rare Disease Ontology such as the Human Phenotype Ontology as required. In order to (ORDO) (V2.1) (http://www.orphadata.org/cgi- generalize the use of EFO from key target-diseases for wider use, bin/inc/ordo_orphanet.inc.php), the Human Disease Ontology we need to compare the target associated disease coverage in (HDO) (06-01-2016 update) (http://disease-ontology.org/), the EFO with the scope of other available disease terminology resources. In this study, we address this issue by using text Mammalian Phenotype Ontology (MP) (access date:31-03- mining and present our initial results. 2016) (http://www.informatics.jax.org/searches/MP_form.shtml), Keywords—text mining; ontology; integration; target validation and Unified Medical Language Systems (UMLS) (2014 AB Release) (https://www.nlm.nih.gov/research/umls/). I. INTRODUCTION Europe PMC is one of the largest biomedical literature databases in the World which provides public access to 31 Integrating data from de novo experiments with data million abstracts and 3.7 million full text articles, covering available in public data resources in a user friendly interface to both PubMed and PubMed Central. In our analyses, we used support decision making has been the goal of the Target the latest achieved version of the Open Access full text articles Validation Platform (https://targetvalidation.org). This (~1 Million) (http://europepmc.org/ftp/archive/v.2016.03/) platform integrates a variety of evidence for a given target from the database. (gene/protein) - disease association, such as reaction pathways We generated and refined dictionaries from the human part that are affected by pathogenic mutations from Reactome [1], of the SwissProt Database (the expert annotated part of and text mined target-disease associations from the Europe UniProt) (http://www.uniprot.org/) and disease and phenotype PubMed Central (Europe PMC) (http://europepmc.org/) parts of EFO, HP, ORDO, MP, HDO and UMLS before literature database [2]. The application of disease ontologies is applying text mining. In the refining process, we filtered out critical to integrate such different data types. the terms that would introduce potentially high numbers of The Experimental Factor Ontology (EFO) false positives. These are the terms having character length < (http://www.ebi.ac.uk/efo/) is the reference resource for 3 and the terms that are ambiguous with common English diseases in the platform (“disease” here encompasses both words (e.g. “Large” is a protein name as well). In addition, we “disease/phenotype” as the disease/phenotype boundary is generated term variations by replacing the widely used Greek blurred in both the platform’s data sources and ontologically). letters in gene/disease names with their symbols (e.g. Therefore, it is important to understand the disease coverage of replacing “alpha” with α). The final target and disease EFO in the scope of target validation, in comparison to the dictionaries consisted of a total of 104,434 Uniprot, 26,617 other available major disease and phenotype resources, in order EFO, 18,332 HP, 20,152 ORDO, 29,800 MP, 21,789 HDO to expand its disease coverage. In this study, we address this and 75,060 UMLS terms. issue by using text mining which is a widely used approach in B. Target and disease name identification synonyms and different classification of a given term in EFO. We used the Europe PMC text-mining pipeline, which is For example, “fetal valproate syndrome” and “Chagas based on Whatizit [5] to annotate target and disease names in cardiomyopathy” from ORDO are not covered by EFO. “HIV” text with the dictionaries described above. Target and disease is classified as “disease and syndrome” in UMLS, indicating name abbreviations can be ambiguous with some other names “HIV infection”, however, in EFO, it is classified as a virus (e.g. ALS which is “Amyotrophic Lateral Sclerosis”, is name. Results suggest that there is some room for ambiguous with “Advanced Life Support”, PMID:26811420). improvement in the EFO and this will be explored for future Therefore, we implemented and used abbreviation filters for releases of EFO. screening out the potential false positive disease/protein IV. CONCLUSION AND FUTURE WORK abbreviations introduced during the annotation process. The abbreviation filters operate based on several heuristic rules. In this study, we demonstrate the use of text mining for For example, text sequences within parentheses (i.e. (XYZ)), analysing and suggesting approaches to expand the appearing in uppercase and having length <6 are identified as disease/phenotype coverage of EFO within the scope of target a name abbreviation candidate and are retained as an validation. We focused on the target-associated disease terms annotation only if any of its long forms from the given disease from EFO and five other major disease resources, but there is ontology exists elsewhere in the document. no reason why this approach could not be applied to other contexts in efforts to integrate across terminologies and C. Target-disease association extraction ontologies. In future, we will extend our analysis to discover The associations are extracted by identifying the target- any trends over the resources, to understand the disease co-occurrences at the sentence level and applying disease/phenotype target space derived from literature and how several filtering rules to reduce noise possibly introduced by much of the associations that we find in EFO scope is relevant. the high sensitivity, low specificity co-occurrence approach. The filtering rules utilise heuristic information from a careful ACKNOWLEDGMENTS manual analysis of the text. They include, filtering out all This work is funded by the Open Targets. articles but the “Research” articles (e.g. Reviews, Case Reports), filtering out target-disease associations appearing in REFERENCES certain sections such as “Methods” and “References”, and [1] A. Fabregat, K. Sidiropoulos, P.Garapati, M. Gillespie, K. Hausmann et filtering out target-disease associations that appear only once in al., “The Reactome pathway Knowledgebase,” Nucleic Acids Res., the body of a given article but not in the article's title or 44(D1):D481-7, 2016. abstract (see [6] for the details). [2] Europe PMC Consortium, “Europe PMC: a full-text literature database for the life sciences and platform for innovation,” Nucleic Acids Res., III. RESULTS AND DISCUSSION 43(Database issue):D1042-8, 2015. [3] I. Spasic, S. Ananiadou, J. McNaught, A. Kumar, “Text Mining and Our target-disease extraction system achieves a Mean- Ontologies in biomedicine:Making sense of raw text,” Breefings in Average Precision value of 81% [6]. Figure 1 presents a Venn Bioinformatics, 6(3), pp. 239-251, 2005. diagram showing the disease terms found in the corpus that [4] S. Pletscher-Frankild, A. Pallejà, K. Tsafou, J.X. Binder, L.J. Jensena, are associated with targets, after application of the target- “DISEASES: Text mining and data integration of disease–gene associations,” Methods, 34, pp.83-89, 2015. disease heuristics above, for each of the six different disease [5] D. Rebholz-Schuhmann, M. Arregui, S. Gaudan, H. Kirsch, A. Jimeno, resources. There are 3,859 HDO, 3,384 MP, 1,610 ORDO, “Text processing through Web services: calling Whatizit,” 4,277 HP and 17,584 UMLS target associated distinct disease Bioinformatics, vol. 24(2), pp.296-8, 2008. terms that are not found by EFO. Possible reasons for the [6] Ş. Kafkas, I. Dunham and J. McEntyre, “Literature Evidence in Open difference in coverage between EFO and the other Targets– a target validation platform,” Phenotype Day @ISMB 2016, terminologies are twofold: nonexistence of a given disease special session of the Bio-Ontologies SIG, 8-12 July 2016, Orlando, Florida, U.S. name in EFO, the coverage of a given disease with different Fig1. Venn diagram showing overlapping target associated disease terms