MutDR – A Resource for Protein Mutation-Disease
    Relations Assembled from Biomedical Literature

                         Ravikumar Komandur Elayavilli, Majid Rastegar-Mojarad, Hongfang Liu
                                              Department of Health Sciences Research
                                                           Mayo Clinic
                                                     Rochester, MN, 55901


    Abstract— Text mining approaches can accelerate the process      mutation and disease (PMD) relations, MutDR. Figure 1
of assembling knowledge from literature. In this abstract, we        outlines the overall workflow in developing MutDR.
present our effort in assembling a resource for protein mutation-
disease relations assembled from literature.
                                                                         Using MutD, we performed a large-scale mining of PMD
   Keywords—literature mining; protein mutation-disease              relations on the complete PubMed data set (till May 2016).
                                                                     The extracted PMD relations were indexed using Elastic
                                                                     Search and a simple web based search interface was
                      I. INTRODUCTION                                developed to enable users to retrieve the literature-mined
     A large amount of information about the role of gene            PMD relations. The user interface has three major
variants and mutations in diseases is available in curated           functionalities: 1) Query by a gene/protein, a disease or both
databases such as OMIM [1], ClinVar [2], and UniprotKB [3].          2) Retrieve results ranked according to the relevance of the
However, much of this information remains ‘locked’ in the            query and further by date. 3) Link the normalized entities
unstructured form in the scientific publications. Since manual       genes and diseases to the external knowledge resources
curation involves significant human effort and time there is         namely UniProtKB and Comparative Toxicogenomics
always a lag in the information between the curated databases        database (CTD) [5] respectively.
and the literature. The recent findings published in the
literature takes significant time to find its way into the curated
knowledgebase. Text mining approaches can accelerate the
process of assembling this knowledge from the published
literature. However, developing a text-mining system with
semantic understanding capability in the biomedical domain is
very challenging. In an earlier work, we described MutD [4], a
literature mining system that extracts relationship between
protein point mutation and diseases from bio-medical
abstracts. In this abstract, we present access to a PubMed scale
resource through a web interface that allows users to retrieve
protein point mutation-disease relations extracted through
biomedical literature mining.

                     II. BACKGROUND
     MutD is a literature mining system that uses an ensemble
of state of the art named entity extraction and normalization
tools and graph based dependency parse representation to             Fig.1 – Overall architecture workflow
extract relations between protein point mutations and diseases
mentioned in biomedical abstracts. It also extends the scope of
literature mining to across multiple sentences through                                            IV. RESULTS
discourse processing and heuristics. MutD achieved a                 MutD extracted 27, 213 protein mutation disease relations
precision of 71% and recall of 58% (F-Measure: 64%) when
                                                                     from nearly 81, 048 PubMed abstracts (out of the total 21
compared against the annotations of UniProtKB.
                                                                     million abstracts). Figure 2 shows some of the user interface
                                                                     features of MutD resource. The PMD relations extracted by
                         III. METHODS                                MutD are indexed using Elastic Search [6].

   In this work, we describe the extension of MutD to create a
PubMed scale resource of literature-mined Protein Point
                                         REFERENCE
                                         [1]   J. Amberger, C. A. Bocchini, A. F. Scott, A. HamoshA, “ McKusick’s
                                               Online Mendelian Inheritance in Man (OMIM),” Nucleic Acids Res 37:
                                               D793–6, 2009
                                         [2]   M. J. Landrum, J. M. Lee, G. R. Riley, W. Jang, W. S. Rubinstein, D.
                                               M. Church, et al., “ClinVar: public archive of relationships among
                                               sequence variation and human phenotype.” Nucleic Acids Res 42:
                                               D980–5, 2014
                                         [3]   The UniProt Consortium, “Activities at the Universal Protein Resource
                                               (UniProt),” Nucleic Acids Res 42:D191–D198, 2014
                                         [4]   K. E. Ravikumar, K. B. Wagholikar, D Li, J. P. Kocher, and H. Liu, “
                                               Text mining facilitates database curation-extraction of mutation-disease
                                               associations from Bio-medical literature,” BMC Bioinformatics, 16(1),
                                               185, 2015.
                                         [5]   A. P. Davis, C. J. Grondin, K. Lennon-Hopkins, C. Saraceni-Richards,
                                               D. Sciaky, B. L. King, T. C. Wiegers, and C. J. Mattingly, “The
                                               Comparative Toxicogenomics Database's 10th year anniversary: update
Fig. 2 – Overall architecture workflow         2015,” Nucleic Acids Res 43: D914-D920, 2015
                                         [6]   R. Kuc and M. Rogozinski. Elasticsearch Server. Packt Publishing Ltd,
                                               2013.