Use of text mining for Experimental Factor Ontology
 coverage expansion in the scope of target validation
                                Şenay Kafkas, Ian Dunham, Helen Parkinson and Jo McEntyre
                  European Bioinformatics Institute – European Molecular Biology Laboratory (EMBL-EBI), and
                                                          Open Targets
                                                    Wellcome Genome Campus
                                                     Hinxton, CB10 1SD, UK


    Abstract—Understanding the molecular biology and                   ontology expansion [3] and target-disease association
development of disease plays a key role in drug development.           identification [4], to compare terms available in existing
Integrating evidence from different experimental approaches            ontologies and present our initial results.
with data available from public resources (such as gene
expression level changes and reaction pathways affected by                                     II. METHODS
pathogenic mutations) can be a powerful approach for evaluating
different aspects of target-disease associations. The application of   A. Resources Used
ontologies is of fundamental importance to effective integration.
The Target Validation Platform is a user-friendly interface that           We used Europe PMC as the literature database, UniProt
integrates such evidences from various resources with the aim of       for target (gene/protein) names and six major disease
assisting scientists to identify and prioritise drug targets.          terminologies: EFO (V2.69), the Human Phenotype ontology
Currently, the EFO is used as the reference ontology for diseases      (HP) (access date:31-03-2016) (http://human-phenotype-
in the platform, importing terms from existing disease ontologies      ontology.github.io/), Orphanet Rare Disease Ontology
such as the Human Phenotype Ontology as required. In order to          (ORDO)            (V2.1)         (http://www.orphadata.org/cgi-
generalize the use of EFO from key target-diseases for wider use,      bin/inc/ordo_orphanet.inc.php), the Human Disease Ontology
we need to compare the target associated disease coverage in           (HDO) (06-01-2016 update) (http://disease-ontology.org/), the
EFO with the scope of other available disease terminology
resources. In this study, we address this issue by using text
                                                                       Mammalian Phenotype Ontology (MP) (access date:31-03-
mining and present our initial results.                                2016)
                                                                       (http://www.informatics.jax.org/searches/MP_form.shtml),
   Keywords—text mining; ontology; integration; target validation      and Unified Medical Language Systems (UMLS) (2014 AB
                                                                       Release) (https://www.nlm.nih.gov/research/umls/).
                       I. INTRODUCTION                                      Europe PMC is one of the largest biomedical literature
                                                                       databases in the World which provides public access to 31
     Integrating data from de novo experiments with data               million abstracts and 3.7 million full text articles, covering
available in public data resources in a user friendly interface to     both PubMed and PubMed Central. In our analyses, we used
support decision making has been the goal of the Target                the latest achieved version of the Open Access full text articles
Validation Platform (https://targetvalidation.org). This               (~1 Million) (http://europepmc.org/ftp/archive/v.2016.03/)
platform integrates a variety of evidence for a given target           from the database.
(gene/protein) - disease association, such as reaction pathways            We generated and refined dictionaries from the human part
that are affected by pathogenic mutations from Reactome [1],           of the SwissProt Database (the expert annotated part of
and text mined target-disease associations from the Europe             UniProt) (http://www.uniprot.org/) and disease and phenotype
PubMed Central (Europe PMC) (http://europepmc.org/)                    parts of EFO, HP, ORDO, MP, HDO and UMLS before
literature database [2]. The application of disease ontologies is      applying text mining. In the refining process, we filtered out
critical to integrate such different data types.                       the terms that would introduce potentially high numbers of
     The       Experimental       Factor      Ontology      (EFO)      false positives. These are the terms having character length <
(http://www.ebi.ac.uk/efo/) is the reference resource for              3 and the terms that are ambiguous with common English
diseases in the platform (“disease” here encompasses both              words (e.g. “Large” is a protein name as well). In addition, we
“disease/phenotype” as the disease/phenotype boundary is               generated term variations by replacing the widely used Greek
blurred in both the platform’s data sources and ontologically).        letters in gene/disease names with their symbols (e.g.
Therefore, it is important to understand the disease coverage of       replacing “alpha” with α). The final target and disease
EFO in the scope of target validation, in comparison to the            dictionaries consisted of a total of 104,434 Uniprot, 26,617
other available major disease and phenotype resources, in order        EFO, 18,332 HP, 20,152 ORDO, 29,800 MP, 21,789 HDO
to expand its disease coverage. In this study, we address this         and 75,060 UMLS terms.
issue by using text mining which is a widely used approach in
B. Target and disease name identification                                synonyms and different classification of a given term in EFO.
    We used the Europe PMC text-mining pipeline, which is                For example, “fetal valproate syndrome” and “Chagas
based on Whatizit [5] to annotate target and disease names in            cardiomyopathy” from ORDO are not covered by EFO. “HIV”
text with the dictionaries described above. Target and disease           is classified as “disease and syndrome” in UMLS, indicating
name abbreviations can be ambiguous with some other names                “HIV infection”, however, in EFO, it is classified as a virus
(e.g. ALS which is “Amyotrophic Lateral Sclerosis”, is                   name. Results suggest that there is some room for
ambiguous with “Advanced Life Support”, PMID:26811420).                  improvement in the EFO and this will be explored for future
Therefore, we implemented and used abbreviation filters for              releases of EFO.
screening out the potential false positive disease/protein
                                                                                       IV. CONCLUSION AND FUTURE WORK
abbreviations introduced during the annotation process. The
abbreviation filters operate based on several heuristic rules.               In this study, we demonstrate the use of text mining for
For example, text sequences within parentheses (i.e. (XYZ)),             analysing and suggesting approaches to expand the
appearing in uppercase and having length <6 are identified as            disease/phenotype coverage of EFO within the scope of target
a name abbreviation candidate and are retained as an                     validation. We focused on the target-associated disease terms
annotation only if any of its long forms from the given disease          from EFO and five other major disease resources, but there is
ontology exists elsewhere in the document.                               no reason why this approach could not be applied to other
                                                                         contexts in efforts to integrate across terminologies and
C. Target-disease association extraction                                 ontologies. In future, we will extend our analysis to discover
     The associations are extracted by identifying the target-           any trends over the resources, to understand the
disease co-occurrences at the sentence level and applying                disease/phenotype target space derived from literature and how
several filtering rules to reduce noise possibly introduced by           much of the associations that we find in EFO scope is relevant.
the high sensitivity, low specificity co-occurrence approach.
The filtering rules utilise heuristic information from a careful                                  ACKNOWLEDGMENTS
manual analysis of the text. They include, filtering out all                   This work is funded by the Open Targets.
articles but the “Research” articles (e.g. Reviews, Case
Reports), filtering out target-disease associations appearing in                                        REFERENCES
certain sections such as “Methods” and “References”, and                 [1]   A. Fabregat, K. Sidiropoulos, P.Garapati, M. Gillespie, K. Hausmann et
filtering out target-disease associations that appear only once in             al., “The Reactome pathway Knowledgebase,” Nucleic Acids Res.,
the body of a given article but not in the article's title or                  44(D1):D481-7, 2016.
abstract (see [6] for the details).                                      [2]   Europe PMC Consortium, “Europe PMC: a full-text literature database
                                                                               for the life sciences and platform for innovation,” Nucleic Acids Res.,
                  III. RESULTS AND DISCUSSION                                  43(Database issue):D1042-8, 2015.
                                                                         [3]   I. Spasic, S. Ananiadou, J. McNaught, A. Kumar, “Text Mining and
    Our target-disease extraction system achieves a Mean-                      Ontologies in biomedicine:Making sense of raw text,” Breefings in
Average Precision value of 81% [6]. Figure 1 presents a Venn                   Bioinformatics, 6(3), pp. 239-251, 2005.
diagram showing the disease terms found in the corpus that               [4]   S. Pletscher-Frankild, A. Pallejà, K. Tsafou, J.X. Binder, L.J. Jensena,
are associated with targets, after application of the target-                  “DISEASES: Text mining and data integration of disease–gene
                                                                               associations,” Methods, 34, pp.83-89, 2015.
disease heuristics above, for each of the six different disease
                                                                         [5]   D. Rebholz-Schuhmann, M. Arregui, S. Gaudan, H. Kirsch, A. Jimeno,
resources. There are 3,859 HDO, 3,384 MP, 1,610 ORDO,                          “Text processing through Web services: calling Whatizit,”
4,277 HP and 17,584 UMLS target associated distinct disease                    Bioinformatics, vol. 24(2), pp.296-8, 2008.
terms that are not found by EFO. Possible reasons for the                [6]   Ş. Kafkas, I. Dunham and J. McEntyre, “Literature Evidence in Open
difference in coverage between EFO and the other                               Targets– a target validation platform,” Phenotype Day @ISMB 2016,
terminologies are twofold: nonexistence of a given disease                     special session of the Bio-Ontologies SIG, 8-12 July 2016, Orlando,
                                                                               Florida, U.S.
name in EFO, the coverage of a given disease with different
Fig1. Venn diagram showing overlapping target associated disease terms