=Paper= {{Paper |id=Vol-1747/BT204_ICBO2016 |storemode=property |title=CancerMine: Knowledge Base Construction for Personalised Cancer Treatment |pdfUrl=https://ceur-ws.org/Vol-1747/BT204_ICBO2016.pdf |volume=Vol-1747 |authors=Jake Lever,Martin Jones,Steven Jm Jones |dblpUrl=https://dblp.org/rec/conf/icbo/LeverJJ16 }} ==CancerMine: Knowledge Base Construction for Personalised Cancer Treatment == https://ceur-ws.org/Vol-1747/BT204_ICBO2016.pdf
                                                         CancerMine
                       Knowledge base construction for personalised cancer treatment

              Jake Lever                                          Martin Jones                                  Steven JM Jones
       Genome Sciences Centre                              Genome Sciences Centre                            Genome Sciences Centre
         BC Cancer Agency                                    BC Cancer Agency                                  BC Cancer Agency
         Vancouver, Canada                                   Vancouver, Canada                                 Vancouver, Canada
          jlever@bcgsc.ca                                     mjones@bcgsc.ca                                   sjones@bcgsc.ca



    Abstract— Knowledge of the relevant genomic aberrations               is commonly locked in the text of associated publications and
that drive a particular cancer type is necessary to accelerate            has not been curated into a usable database. Cancer types also
efficient interpretation of genomic data and enable large-scale           play an extremely important contextual role in understanding
endeavours in precision medicine. Currently, this field is limited        the function of a particular gene. The NOTCH gene can have
by the lack of focused and scalable literature curation tools that        oncogenic effects in blood cancers and be tumour suppressive
can reliably capture the required information. Here we present a          in head & neck cancers [4]. Therefore it is very important to
knowledge-base of genes that have been described in the                   link specific genes with a specific form of cancer.
literature as drivers, oncogenes or tumour suppressors with
respect to a specific type of cancer. We have annotated a large              Previous work has linked gene mutations with diseases
body of literature which reports oncogenic aberrations using a            based on simple distance metrics [5] and used crowdsourcing
custom designed annotation tool. We then applied VERSE, an in-            to annotate gene mutation relations [6]. Our approach uses
house relation extraction tool, to catalogue driver mutations and         syntactic and semantic information to predict relations between
illustrate the ability to build a useful resource for clinical            cancer types and genes to generate a usable knowledge base
interpretation of genomic data for personalised treatment                 based on a smaller set of expert annotated data.
approaches.

    Keywords—relation extraction, oncogenomics, driver mutations                                    II. METHODS
                                                                               In order to identify sentences that discussed both a human
                         I. INTRODUCTION                                  gene and a cancer type, word-lists were generated from popular
                                                                          bioinformatics ontologies. Due to existing named entity
    Improvements in sequencing technology now allow for
                                                                          recognition tools missing some specific cancer types, a custom
investigation of individual cancers in a clinically actionable
                                                                          word list was created from the UMLS Metathesaurus [7]. All
time frame. These technologies reveal a set of mutations in the
                                                                          terms and their synonyms of the type Neoplasm (T191) were
genome of an individual patient’s cancer. These mutations may
                                                                          selected. This list was then manually trimmed to remove very
disable molecular pathways, up-regulate them or dramatically
                                                                          general cancer terms so that only cancer types remained. The
change their function in the quest for increased tumour growth
                                                                          NCBI Gene list [8] with all alternative names was used to
and drug resistance. A bioinformatician examining these sets of
                                                                          create a list of human genes with their synonyms and was
mutations must identify the important changes and highlight
                                                                          manually trimmed for several gene names that are common
those relevant for clinical decisions.
                                                                          words in biomedical literature (e.g. MICE). The cancer type
     Distinguishing between driver mutations, that are important          list contained 12,522 terms and the gene list contained 59,860
in the tumour development, and passenger mutations, that are              terms. Both word lists were filtered by a list of common
coincidental mutations, remains a huge challenge in cancer                English words. This word list was built from the stop words
research. Large scale projects, including The Cancer Genome               from the NLTK toolkit [9], the most frequent 5,000 words
Atlas (TCGA) [1], have shone a light on the mutational                    based on the Corpus of Contemporary American English [10]
landscapes of a variety of cancer types. However, TCGA by                 and a stop word list associated with the NCBI gene data.
necessity focuses on only the most common or accessible types                 Table 1. Examples of annotated sentences used as training data
of cancer and only on primary tumours. Metastatic tumours are             for (a) driving, (b) oncogenic and (c) tumour suppressive associations
a hugely important area, causing 90% of cancer-related                    with PubMed IDs. Gene names are underlined and cancers are bolded.
mortality [2], and are not as well studied. Existing resources
(such as IntOGen [3]) listing known or statistically derived                      Recent studies reported S100A2 protein is a molecular driver
                                                                          (a)        in TGF-β induced cell invasion and migration in hepatic
driver genes rely on these large-scale projects but miss variants
                                                                                                  carcinoma.(PMID:25591983)
which may be exquisitely characterised in smaller scale studies                         In summary, our work suggests a new direction for
or are associated with incidental findings discussed in the               (b)       understanding the oncogenic function of TRAF4 in breast
literature. Smaller studies on specific cancer types are an                                         cancer. (PMID:25738361)
important resource for cancer researchers in understanding                         In present report, the tumor suppressive role of DMTF1 was
driver mutations. However, the information from these studies             (c)
                                                                                  studied and confirmed in bladder cancer. (PMID:25965824)
    This research was supported by a Vanier Canada Graduate Scholarship
and funding from Compute Canada.
    Medical literature was downloaded in XML format from                    Table 2. Overview of data in CancerMine knowledge base
the MEDLINE database of PubMed citations and the Pubmed
Central Open Access subset. The raw text was extracted from                  # of analysed sentences                            60,464
the files and processed using the Stanford CoreNLP tools [11].               # of gene terms                                   155,646
Text was split into sentences and tokenized. A sentence that
contained a term from the cancer types word list and a term                  # of cancer terms                                  79,290
from the human gene names wordlist was flagged and stored in
                                                                             # of driver annotations                            1,967
a MySQL database.
     In order to enrich the dataset for sentences likely discussing          # of oncogenic annotations                         6,877
important cancer aberrations, the sentences were filtered for                # of tumour suppressive annotations                3,075
those containing “driv”, “oncogen” or “tumo(u)r suppress”. In
literature from 2015, 13,765 sentences were extracted and
examples are shown in Table 1. Equal numbers of sentences
for each filter were then prepared for annotation.                                               IV. CONCLUSION
    The CancerMine annotation system displays each pair of                In conclusion, we presented a full pipeline for identifying
cancer type term and gene name term that appear in the same           sentences that discuss a gene and cancer type, annotating a
sentence. The user can then tag the term pair as having a driver,     large number of sentences and training a high-quality relation
oncogenic, tumour suppressive or no relation. Driver relations        classifier on them. This data is an important resource for
require the sentence to specifically discuss a genomic                improved personalised cancer treatment and can be expanded
aberration driving cancer development. Oncogenic relations            to address other specific questions relevant to genome
require the text to state that an aberration is involved in           interpretation, such as clinical outcome.
oncogenesis while a tumour suppressive relation requires the
text to state that the aberration has a tumour suppressive role.                                    REFERENCES
In total, 1203 sentences were annotated by a single annotator
providing 504 driver, 521 oncogenic and 215 tumour                    [1]  Weinstein, John N., et al. "The cancer genome atlas pan-cancer analysis
suppressive relations. Note that 352 sentences had no relations            project." Nature genetics 45.10 (2013): 1113-1120.
and 412 sentences had more than one relation.                         [2] Mehlen, Patrick, and Alain Puisieux. "Metastasis: a question of life or
                                                                           death." Nature Reviews Cancer 6.6 (2006): 449-458.
    Annotated sentences were then transformed into the input
                                                                      [3] Gonzalez-Perez, Abel, et al. "IntOGen-mutations identifies cancer
format data appropriate for use with the Vancouver Event and               drivers across tumor types." Nature methods 10.11 (2013): 1081-1082.
Relation System for Extraction (VERSE) [12]. It was used to           [4] Radtke, Freddy, and Kenneth Raj. "The role of Notch in tumorigenesis:
predict triggerless events between gene and disease entities.              oncogene or tumour suppressor?." Nature Reviews Cancer 3.10 (2003):
VERSE utilises bag-of-words features based on the entire                   756-767.
sentence, dependency paths and individual entities. A logistic        [5] Singhal, Ayush, Michael Simmons, and Zhiyong Lu. "Text mining for
regression classifier was used in order to generate a set of               precision medicine: automating disease-mutation relationship extraction
probabilities for each annotation type. Only annotations with a            from biomedical literature." Journal of the American Medical
                                                                           Informatics Association (2016): ocw041.
probability above a certain threshold were output.
                                                                      [6] Burger, John D., Emily Doughty, Ritu Khare, Chih-Hsuan Wei,
                                                                           Rajashree Mishra, John Aberdeen et al. "Hybrid curation of gene–
                         III. RESULTS                                      mutation relations combining automated extraction and crowdsourcing."
                                                                           Database 2014 (2014): bau094.
    A two-fold cross validation approach was used during a
                                                                      [7] Bodenreider, Olivier. "The unified medical language system (UMLS):
parameters search on a 6000 core cluster. A stochastic search              integrating biomedical terminology." Nucleic acids research 32.suppl 1
strategy was used. The F-score metric with beta=0.1 was used               (2004): D267-D270.
to evaluate the success of each run. This allowed a greater           [8] Maglott, Donna, Jim Ostell, Kim D. Pruitt, and Tatiana Tatusova.
focus on precision to improve the quality of the resulting                 "Entrez Gene: gene-centered information at NCBI." Nucleic acids
knowledge base. ~75,000 different runs were executed and the               research 33.suppl 1 (2005): D54-D58.
optimal parameters were selected based on an average F-score          [9] Bird, Steven. "NLTK: the natural language toolkit." Proceedings of the
(beta=0.1) of 0.8845. These parameters provided an average                 COLING/ACL on Interactive presentation sessions. Association for
                                                                           Computational Linguistics (2006).
precision of 0.941 and recall of 0.128.
                                                                      [10] Davies, Mark. "The 385+ million word Corpus of Contemporary
    The optimal classifier was then applied to the larger set of           American English (1990–2008+): Design, architecture, and linguistic
unannotated sentences. These sentences were from all                       insights." International journal of corpus linguistics 14.2 (2009): 159-
                                                                           190.
accessible literature from 2010 to 2016. Table 2 shows an
                                                                      [11] Manning, Christopher D., Mihai Surdeanu, John Bauer, Jenny Rose
overview of the data included in the CancerMine knowledge                  Finkel, Steven Bethard, and David McClosky. "The Stanford CoreNLP
base. The difference in proportion of relations in the training            Natural Language Processing Toolkit." ACL (System Demonstrations).
set and the final knowledge base is due to the selection of equal          (2014)
numbers of filtered sentences for each possible relation type for     [12] Lever, Jake and Steven JM Jones. "VERSE: Event and relation
annotation. Importantly all annotations are associated with a              extraction in the BioNLP 2016 Shared Task." Proceedings of the
PubMed or PubMedCentral ID to allow easy access to the                     BioNLP Shared Task 2016 Workshop (2016). in press
original text of the article or abstract.