<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Knowledge base construction for personalised cancer treatment</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Jake Lever</string-name>
          <email>jlever@bcgsc.ca</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Steven JM Jones</string-name>
          <email>mjones@bcgsc.ca</email>
          <email>sjones@bcgsc.ca</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Genome Sciences Centre BC Cancer Agency Vancouver</institution>
          ,
          <country country="CA">Canada</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>- Knowledge of the relevant genomic aberrations that drive a particular cancer type is necessary to accelerate efficient interpretation of genomic data and enable large-scale endeavours in precision medicine. Currently, this field is limited by the lack of focused and scalable literature curation tools that can reliably capture the required information. Here we present a knowledge-base of genes that have been described in the literature as drivers, oncogenes or tumour suppressors with respect to a specific type of cancer. We have annotated a large body of literature which reports oncogenic aberrations using a custom designed annotation tool. We then applied VERSE, an inhouse relation extraction tool, to catalogue driver mutations and illustrate the ability to build a useful resource for clinical interpretation of genomic data for personalised treatment approaches.</p>
      </abstract>
      <kwd-group>
        <kwd>relation extraction</kwd>
        <kwd>oncogenomics</kwd>
        <kwd>driver mutations</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>I. INTRODUCTION</title>
      <p>Improvements in sequencing technology now allow for
investigation of individual cancers in a clinically actionable
time frame. These technologies reveal a set of mutations in the
genome of an individual patient’s cancer. These mutations may
disable molecular pathways, up-regulate them or dramatically
change their function in the quest for increased tumour growth
and drug resistance. A bioinformatician examining these sets of
mutations must identify the important changes and highlight
those relevant for clinical decisions.</p>
      <p>
        Distinguishing between driver mutations, that are important
in the tumour development, and passenger mutations, that are
coincidental mutations, remains a huge challenge in cancer
research. Large scale projects, including The Cancer Genome
Atlas (TCGA) [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], have shone a light on the mutational
landscapes of a variety of cancer types. However, TCGA by
necessity focuses on only the most common or accessible types
of cancer and only on primary tumours. Metastatic tumours are
a hugely important area, causing 90% of cancer-related
mortality [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], and are not as well studied. Existing resources
(such as IntOGen [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]) listing known or statistically derived
driver genes rely on these large-scale projects but miss variants
which may be exquisitely characterised in smaller scale studies
or are associated with incidental findings discussed in the
literature. Smaller studies on specific cancer types are an
important resource for cancer researchers in understanding
driver mutations. However, the information from these studies
      </p>
      <p>
        This research was supported by a Vanier Canada Graduate Scholarship
and funding from Compute Canada.
is commonly locked in the text of associated publications and
has not been curated into a usable database. Cancer types also
play an extremely important contextual role in understanding
the function of a particular gene. The NOTCH gene can have
oncogenic effects in blood cancers and be tumour suppressive
in head &amp; neck cancers [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Therefore it is very important to
link specific genes with a specific form of cancer.
      </p>
      <p>
        Previous work has linked gene mutations with diseases
based on simple distance metrics [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] and used crowdsourcing
to annotate gene mutation relations [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. Our approach uses
syntactic and semantic information to predict relations between
cancer types and genes to generate a usable knowledge base
based on a smaller set of expert annotated data.
      </p>
      <p>
        In order to identify sentences that discussed both a human
gene and a cancer type, word-lists were generated from popular
bioinformatics ontologies. Due to existing named entity
recognition tools missing some specific cancer types, a custom
word list was created from the UMLS Metathesaurus [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. All
terms and their synonyms of the type Neoplasm (T191) were
selected. This list was then manually trimmed to remove very
general cancer terms so that only cancer types remained. The
NCBI Gene list [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] with all alternative names was used to
create a list of human genes with their synonyms and was
manually trimmed for several gene names that are common
words in biomedical literature (e.g. MICE). The cancer type
list contained 12,522 terms and the gene list contained 59,860
terms. Both word lists were filtered by a list of common
English words. This word list was built from the stop words
from the NLTK toolkit [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], the most frequent 5,000 words
based on the Corpus of Contemporary American English [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]
and a stop word list associated with the NCBI gene data.
      </p>
      <p>Recent studies reported S100A2 protein is a molecular driver
in TGF-β induced cell invasion and migration in hepatic
carcinoma.(PMID:25591983)</p>
      <p>
        In summary, our work suggests a new direction for
understanding the oncogenic function of TRAF4 in breast
cancer. (PMID:25738361)
In present report, the tumor suppressive role of DMTF1 was
studied and confirmed in bladder cancer. (PMID:25965824)
Medical literature was downloaded in XML format from
the MEDLINE database of PubMed citations and the Pubmed
Central Open Access subset. The raw text was extracted from
the files and processed using the Stanford CoreNLP tools [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ].
Text was split into sentences and tokenized. A sentence that
contained a term from the cancer types word list and a term
from the human gene names wordlist was flagged and stored in
a MySQL database.
      </p>
      <p>In order to enrich the dataset for sentences likely discussing
important cancer aberrations, the sentences were filtered for
those containing “driv”, “oncogen” or “tumo(u)r suppress”. In
literature from 2015, 13,765 sentences were extracted and
examples are shown in Table 1. Equal numbers of sentences
for each filter were then prepared for annotation.</p>
      <p>The CancerMine annotation system displays each pair of
cancer type term and gene name term that appear in the same
sentence. The user can then tag the term pair as having a driver,
oncogenic, tumour suppressive or no relation. Driver relations
require the sentence to specifically discuss a genomic
aberration driving cancer development. Oncogenic relations
require the text to state that an aberration is involved in
oncogenesis while a tumour suppressive relation requires the
text to state that the aberration has a tumour suppressive role.
In total, 1203 sentences were annotated by a single annotator
providing 504 driver, 521 oncogenic and 215 tumour
suppressive relations. Note that 352 sentences had no relations
and 412 sentences had more than one relation.</p>
      <p>
        Annotated sentences were then transformed into the input
format data appropriate for use with the Vancouver Event and
Relation System for Extraction (VERSE) [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. It was used to
predict triggerless events between gene and disease entities.
VERSE utilises bag-of-words features based on the entire
sentence, dependency paths and individual entities. A logistic
regression classifier was used in order to generate a set of
probabilities for each annotation type. Only annotations with a
probability above a certain threshold were output.
      </p>
    </sec>
    <sec id="sec-2">
      <title>III. RESULTS</title>
      <p>A two-fold cross validation approach was used during a
parameters search on a 6000 core cluster. A stochastic search
strategy was used. The F-score metric with beta=0.1 was used
to evaluate the success of each run. This allowed a greater
focus on precision to improve the quality of the resulting
knowledge base. ~75,000 different runs were executed and the
optimal parameters were selected based on an average F-score
(beta=0.1) of 0.8845. These parameters provided an average
precision of 0.941 and recall of 0.128.</p>
      <p>The optimal classifier was then applied to the larger set of
unannotated sentences. These sentences were from all
accessible literature from 2010 to 2016. Table 2 shows an
overview of the data included in the CancerMine knowledge
base. The difference in proportion of relations in the training
set and the final knowledge base is due to the selection of equal
numbers of filtered sentences for each possible relation type for
annotation. Importantly all annotations are associated with a
PubMed or PubMedCentral ID to allow easy access to the
original text of the article or abstract.
# of tumour suppressive annotations
60,464
155,646
79,290
1,967
6,877
3,075</p>
    </sec>
    <sec id="sec-3">
      <title>IV. CONCLUSION</title>
      <p>In conclusion, we presented a full pipeline for identifying
sentences that discuss a gene and cancer type, annotating a
large number of sentences and training a high-quality relation
classifier on them. This data is an important resource for
improved personalised cancer treatment and can be expanded
to address other specific questions relevant to genome
interpretation, such as clinical outcome.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Weinstein</surname>
            ,
            <given-names>John N.</given-names>
          </string-name>
          , et al.
          <article-title>"The cancer genome atlas pan-cancer analysis project</article-title>
          .
          <source>" Nature genetics 45.10</source>
          (
          <year>2013</year>
          ):
          <fpage>1113</fpage>
          -
          <lpage>1120</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Mehlen</surname>
            , Patrick, and
            <given-names>Alain</given-names>
          </string-name>
          <string-name>
            <surname>Puisieux</surname>
          </string-name>
          .
          <article-title>"Metastasis: a question of life or death."</article-title>
          <source>Nature Reviews Cancer</source>
          <volume>6</volume>
          .
          <issue>6</issue>
          (
          <year>2006</year>
          ):
          <fpage>449</fpage>
          -
          <lpage>458</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Gonzalez-Perez</surname>
          </string-name>
          , Abel, et al.
          <article-title>"IntOGen-mutations identifies cancer drivers across tumor types</article-title>
          .
          <source>" Nature methods 10</source>
          .11 (
          <year>2013</year>
          ):
          <fpage>1081</fpage>
          -
          <lpage>1082</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Radtke</surname>
            , Freddy, and
            <given-names>Kenneth</given-names>
          </string-name>
          <string-name>
            <surname>Raj</surname>
          </string-name>
          .
          <article-title>"The role of Notch in tumorigenesis: oncogene or tumour suppressor?."</article-title>
          <source>Nature Reviews Cancer</source>
          <volume>3</volume>
          .10 (
          <year>2003</year>
          ):
          <fpage>756</fpage>
          -
          <lpage>767</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Singhal</surname>
            , Ayush,
            <given-names>Michael</given-names>
          </string-name>
          <string-name>
            <surname>Simmons</surname>
            , and
            <given-names>Zhiyong</given-names>
          </string-name>
          <string-name>
            <surname>Lu</surname>
          </string-name>
          .
          <article-title>"Text mining for precision medicine: automating disease-mutation relationship extraction from biomedical literature." Journal of the American Medical Informatics Association (</article-title>
          <year>2016</year>
          ):
          <fpage>ocw041</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Burger</surname>
            ,
            <given-names>John D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Emily</surname>
            <given-names>Doughty</given-names>
          </string-name>
          , Ritu Khare,
          <string-name>
            <surname>Chih-Hsuan</surname>
            <given-names>Wei</given-names>
          </string-name>
          , Rajashree Mishra, John Aberdeen et al.
          <article-title>"Hybrid curation of genemutation relations combining automated extraction and crowdsourcing</article-title>
          .
          <source>" Database</source>
          <year>2014</year>
          (
          <year>2014</year>
          ):
          <fpage>bau094</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>Bodenreider</surname>
            ,
            <given-names>Olivier. "</given-names>
          </string-name>
          <article-title>The unified medical language system (UMLS): integrating biomedical terminology."</article-title>
          <source>Nucleic acids research 32.suppl 1</source>
          (
          <year>2004</year>
          ):
          <fpage>D267</fpage>
          -
          <lpage>D270</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <surname>Maglott</surname>
            , Donna, Jim Ostell,
            <given-names>Kim D.</given-names>
          </string-name>
          <string-name>
            <surname>Pruitt</surname>
            , and
            <given-names>Tatiana</given-names>
          </string-name>
          <string-name>
            <surname>Tatusova</surname>
          </string-name>
          .
          <article-title>"Entrez Gene: gene-centered information at NCBI."</article-title>
          <source>Nucleic acids research 33.suppl 1</source>
          (
          <year>2005</year>
          ):
          <fpage>D54</fpage>
          -
          <lpage>D58</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <surname>Bird</surname>
            ,
            <given-names>Steven.</given-names>
          </string-name>
          <article-title>"NLTK: the natural language toolkit." Proceedings of the COLING/ACL on Interactive presentation sessions</article-title>
          .
          <source>Association for Computational Linguistics</source>
          (
          <year>2006</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Davies</surname>
            ,
            <given-names>Mark. "</given-names>
          </string-name>
          <article-title>The 385+ million word Corpus of Contemporary American English (</article-title>
          <year>1990</year>
          -2008+)
          <article-title>: Design, architecture, and linguistic insights."</article-title>
          <source>International journal of corpus linguistics 14.2</source>
          (
          <year>2009</year>
          ):
          <fpage>159</fpage>
          -
          <lpage>190</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Manning</surname>
            ,
            <given-names>Christopher D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mihai</surname>
            <given-names>Surdeanu</given-names>
          </string-name>
          , John Bauer, Jenny Rose Finkel, Steven Bethard, and
          <string-name>
            <surname>David McClosky. "The Stanford CoreNLP Natural Language Processing Toolkit</surname>
          </string-name>
          .
          <article-title>" ACL (System Demonstrations)</article-title>
          . (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Lever</surname>
          </string-name>
          , Jake and Steven JM Jones.
          <article-title>"VERSE: Event and relation extraction in the BioNLP 2016 Shared Task</article-title>
          .
          <source>" Proceedings of the BioNLP Shared Task 2016 Workshop</source>
          (
          <year>2016</year>
          ). in press
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>