CancerMine Knowledge base construction for personalised cancer treatment Jake Lever Martin Jones Steven JM Jones Genome Sciences Centre Genome Sciences Centre Genome Sciences Centre BC Cancer Agency BC Cancer Agency BC Cancer Agency Vancouver, Canada Vancouver, Canada Vancouver, Canada jlever@bcgsc.ca mjones@bcgsc.ca sjones@bcgsc.ca Abstract— Knowledge of the relevant genomic aberrations is commonly locked in the text of associated publications and that drive a particular cancer type is necessary to accelerate has not been curated into a usable database. Cancer types also efficient interpretation of genomic data and enable large-scale play an extremely important contextual role in understanding endeavours in precision medicine. Currently, this field is limited the function of a particular gene. The NOTCH gene can have by the lack of focused and scalable literature curation tools that oncogenic effects in blood cancers and be tumour suppressive can reliably capture the required information. Here we present a in head & neck cancers [4]. Therefore it is very important to knowledge-base of genes that have been described in the link specific genes with a specific form of cancer. literature as drivers, oncogenes or tumour suppressors with respect to a specific type of cancer. We have annotated a large Previous work has linked gene mutations with diseases body of literature which reports oncogenic aberrations using a based on simple distance metrics [5] and used crowdsourcing custom designed annotation tool. We then applied VERSE, an in- to annotate gene mutation relations [6]. Our approach uses house relation extraction tool, to catalogue driver mutations and syntactic and semantic information to predict relations between illustrate the ability to build a useful resource for clinical cancer types and genes to generate a usable knowledge base interpretation of genomic data for personalised treatment based on a smaller set of expert annotated data. approaches. Keywords—relation extraction, oncogenomics, driver mutations II. METHODS In order to identify sentences that discussed both a human I. INTRODUCTION gene and a cancer type, word-lists were generated from popular bioinformatics ontologies. Due to existing named entity Improvements in sequencing technology now allow for recognition tools missing some specific cancer types, a custom investigation of individual cancers in a clinically actionable word list was created from the UMLS Metathesaurus [7]. All time frame. These technologies reveal a set of mutations in the terms and their synonyms of the type Neoplasm (T191) were genome of an individual patient’s cancer. These mutations may selected. This list was then manually trimmed to remove very disable molecular pathways, up-regulate them or dramatically general cancer terms so that only cancer types remained. The change their function in the quest for increased tumour growth NCBI Gene list [8] with all alternative names was used to and drug resistance. A bioinformatician examining these sets of create a list of human genes with their synonyms and was mutations must identify the important changes and highlight manually trimmed for several gene names that are common those relevant for clinical decisions. words in biomedical literature (e.g. MICE). The cancer type Distinguishing between driver mutations, that are important list contained 12,522 terms and the gene list contained 59,860 in the tumour development, and passenger mutations, that are terms. Both word lists were filtered by a list of common coincidental mutations, remains a huge challenge in cancer English words. This word list was built from the stop words research. Large scale projects, including The Cancer Genome from the NLTK toolkit [9], the most frequent 5,000 words Atlas (TCGA) [1], have shone a light on the mutational based on the Corpus of Contemporary American English [10] landscapes of a variety of cancer types. However, TCGA by and a stop word list associated with the NCBI gene data. necessity focuses on only the most common or accessible types Table 1. Examples of annotated sentences used as training data of cancer and only on primary tumours. Metastatic tumours are for (a) driving, (b) oncogenic and (c) tumour suppressive associations a hugely important area, causing 90% of cancer-related with PubMed IDs. Gene names are underlined and cancers are bolded. mortality [2], and are not as well studied. Existing resources (such as IntOGen [3]) listing known or statistically derived Recent studies reported S100A2 protein is a molecular driver (a) in TGF-β induced cell invasion and migration in hepatic driver genes rely on these large-scale projects but miss variants carcinoma.(PMID:25591983) which may be exquisitely characterised in smaller scale studies In summary, our work suggests a new direction for or are associated with incidental findings discussed in the (b) understanding the oncogenic function of TRAF4 in breast literature. Smaller studies on specific cancer types are an cancer. (PMID:25738361) important resource for cancer researchers in understanding In present report, the tumor suppressive role of DMTF1 was driver mutations. However, the information from these studies (c) studied and confirmed in bladder cancer. (PMID:25965824) This research was supported by a Vanier Canada Graduate Scholarship and funding from Compute Canada. Medical literature was downloaded in XML format from Table 2. Overview of data in CancerMine knowledge base the MEDLINE database of PubMed citations and the Pubmed Central Open Access subset. The raw text was extracted from # of analysed sentences 60,464 the files and processed using the Stanford CoreNLP tools [11]. # of gene terms 155,646 Text was split into sentences and tokenized. A sentence that contained a term from the cancer types word list and a term # of cancer terms 79,290 from the human gene names wordlist was flagged and stored in # of driver annotations 1,967 a MySQL database. In order to enrich the dataset for sentences likely discussing # of oncogenic annotations 6,877 important cancer aberrations, the sentences were filtered for # of tumour suppressive annotations 3,075 those containing “driv”, “oncogen” or “tumo(u)r suppress”. In literature from 2015, 13,765 sentences were extracted and examples are shown in Table 1. Equal numbers of sentences for each filter were then prepared for annotation. IV. CONCLUSION The CancerMine annotation system displays each pair of In conclusion, we presented a full pipeline for identifying cancer type term and gene name term that appear in the same sentences that discuss a gene and cancer type, annotating a sentence. The user can then tag the term pair as having a driver, large number of sentences and training a high-quality relation oncogenic, tumour suppressive or no relation. Driver relations classifier on them. This data is an important resource for require the sentence to specifically discuss a genomic improved personalised cancer treatment and can be expanded aberration driving cancer development. Oncogenic relations to address other specific questions relevant to genome require the text to state that an aberration is involved in interpretation, such as clinical outcome. oncogenesis while a tumour suppressive relation requires the text to state that the aberration has a tumour suppressive role. REFERENCES In total, 1203 sentences were annotated by a single annotator providing 504 driver, 521 oncogenic and 215 tumour [1] Weinstein, John N., et al. "The cancer genome atlas pan-cancer analysis suppressive relations. Note that 352 sentences had no relations project." Nature genetics 45.10 (2013): 1113-1120. and 412 sentences had more than one relation. [2] Mehlen, Patrick, and Alain Puisieux. "Metastasis: a question of life or death." Nature Reviews Cancer 6.6 (2006): 449-458. Annotated sentences were then transformed into the input [3] Gonzalez-Perez, Abel, et al. "IntOGen-mutations identifies cancer format data appropriate for use with the Vancouver Event and drivers across tumor types." Nature methods 10.11 (2013): 1081-1082. Relation System for Extraction (VERSE) [12]. It was used to [4] Radtke, Freddy, and Kenneth Raj. "The role of Notch in tumorigenesis: predict triggerless events between gene and disease entities. oncogene or tumour suppressor?." Nature Reviews Cancer 3.10 (2003): VERSE utilises bag-of-words features based on the entire 756-767. sentence, dependency paths and individual entities. A logistic [5] Singhal, Ayush, Michael Simmons, and Zhiyong Lu. "Text mining for regression classifier was used in order to generate a set of precision medicine: automating disease-mutation relationship extraction probabilities for each annotation type. Only annotations with a from biomedical literature." Journal of the American Medical Informatics Association (2016): ocw041. probability above a certain threshold were output. [6] Burger, John D., Emily Doughty, Ritu Khare, Chih-Hsuan Wei, Rajashree Mishra, John Aberdeen et al. "Hybrid curation of gene– III. RESULTS mutation relations combining automated extraction and crowdsourcing." Database 2014 (2014): bau094. A two-fold cross validation approach was used during a [7] Bodenreider, Olivier. "The unified medical language system (UMLS): parameters search on a 6000 core cluster. A stochastic search integrating biomedical terminology." Nucleic acids research 32.suppl 1 strategy was used. The F-score metric with beta=0.1 was used (2004): D267-D270. to evaluate the success of each run. This allowed a greater [8] Maglott, Donna, Jim Ostell, Kim D. Pruitt, and Tatiana Tatusova. focus on precision to improve the quality of the resulting "Entrez Gene: gene-centered information at NCBI." Nucleic acids knowledge base. ~75,000 different runs were executed and the research 33.suppl 1 (2005): D54-D58. optimal parameters were selected based on an average F-score [9] Bird, Steven. "NLTK: the natural language toolkit." Proceedings of the (beta=0.1) of 0.8845. These parameters provided an average COLING/ACL on Interactive presentation sessions. Association for Computational Linguistics (2006). precision of 0.941 and recall of 0.128. [10] Davies, Mark. "The 385+ million word Corpus of Contemporary The optimal classifier was then applied to the larger set of American English (1990–2008+): Design, architecture, and linguistic unannotated sentences. These sentences were from all insights." International journal of corpus linguistics 14.2 (2009): 159- 190. accessible literature from 2010 to 2016. Table 2 shows an [11] Manning, Christopher D., Mihai Surdeanu, John Bauer, Jenny Rose overview of the data included in the CancerMine knowledge Finkel, Steven Bethard, and David McClosky. "The Stanford CoreNLP base. The difference in proportion of relations in the training Natural Language Processing Toolkit." ACL (System Demonstrations). set and the final knowledge base is due to the selection of equal (2014) numbers of filtered sentences for each possible relation type for [12] Lever, Jake and Steven JM Jones. "VERSE: Event and relation annotation. Importantly all annotations are associated with a extraction in the BioNLP 2016 Shared Task." Proceedings of the PubMed or PubMedCentral ID to allow easy access to the BioNLP Shared Task 2016 Workshop (2016). in press original text of the article or abstract.