ICBO 2014 Proceedings Using  Ontology  Fingerprints  to  disambiguate  gene  name  entities  in  the   biomedical  literature   1 1 1 1 1 1 1 2 Guocai Chen , Jieyi Zhao , Trevor Cohen , Cui Tao , Jingchun Sun , Hua Xu , Elmer V. Bernstam , Andrew Lawson , 3 3 3 3 3 1,* Jia Zeng , Amber M. Johnson , Vijaykumar Holla , Ann M. Bailey , Funda Meric-Bernstam , W. Jim Zheng 1 Center for Computational Biomedicine, School of Biomedical informatics, University of Texas Health Science Center at Houston 2 Department of Public Health Science, Medical University of South Carolina, 135 Cannon Street, Suite 300, Charleston, South Carolina, 29425 3 Department of Investigational Cancer Therapeutics, Institute for Personalized Cancer Therapy, UT-MD Anderson Cancer Center, 1400 Holcombe Blvd., FC8.3044, Houston, TX 77030 Personalized cancer therapy relies on articles were selected and marked by oncologists extensive knowledge of cancer genes, their and research staff from the Institute for variants and treatments that target these variants. Personalized Cancer Therapy at the UT MD While most of this knowledge can be extracted Anderson Cancer Center. For the selected genes, from the biomedical literature, identifying genes we obtained 93.6% precision for gene name and their associated publications with high disambiguation and 80.4% AUC for gene and precision is still a daunting task, often challenged article association. For additional 223 human by ambiguous gene names in the text. One way genes relevant to cancer, by using the Ontology to disambiguate gene name is through gene Fingerprints generated from the publications normalization - the task of mapping a named before December 20, 2009 for these genes to entity in text to an identifier in a database. predict the association of these genes with However, many genes have multiple names or papers published after 2009, we got a highest aliases, part of them share identical names, even precision up to 92.7%. though they are distinct genes with different We investigated the feasibility of using functions. Developing new methods to distinguish Ontology Fingerprints to discover associations these ambiguous gene names will significantly between genes and PubMed articles, as well as improve the accuracy of information retrieval and to disambiguate gene name mentions. We other research-enabling applications. obtained reasonable accuracy for gene name To overcome this hurdle, we generated a non- disambiguation and gene and PubMed article supervised approach to create ontology profiles association. The Ontology Fingerprint method termed Ontology Fingerprints for selected genes can improve gene normalization and the analysis that are relevant for personalized cancer therapy of gene and article association. We conclude that from the literature. The Ontology Fingerprint for a Ontology Fingerprints can help disambiguate gene consists of a set of associated GO terms gene names mentioned in text and analyze the and their ancestors defined by biologists, with an association between genes and articles. enrichment p-value mapping to each term to The core algorithm was implemented using a reflect the significance of the term. We first used GPU-based MapReduce framework to handle big the ABGene/GNAT to identify gene names from data and to improve performance. Comparing the PubMed abstracts, and matched the names with running the program on Lonestar cluster, we to the gene name or alias of known genes. The can gain the same magnitude of speed when ambiguous names were then assessed by using the GPU MapReduce framework. Overall, evaluating the degree to which the abstract the MapReduce framework makes execution of matched the Ontology Fingerprints of the genes. the program more convenient and affordable, Focusing only on genes targeted by especially on a workstation with an appropriate therapeutics for personalized cancer therapy. graphic card. Eleven of these genes and relevant PubMed *Corresponding Author: Wenjin.j.Zheng@uth.tmc.edu 66