<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>COSMIC, curating the cancer variome.</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Simon A. Forbes</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Gurpreet Tang</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jon Teague</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Andrew Futreal</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mike Stratton</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus</institution>
          ,
          <addr-line>Hinxton, Cambridge, CB10 1SA</addr-line>
          ,
          <country country="UK">United Kingdom</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Background. COSMIC (http://www.sanger.ac.uk/cosmic) is a system designed to curate the world's literature on somatic mutations in known cancer genes. Initially conceived to capture the mutation spread in point-mutated genes, COSMIC has now grown to encompass gene fusion products of genome rearrangement events which generate completely novel transcripts, together with all the somatic mutation data from candidate gene screens at the Cancer Genome Project, UK (CGP), covering almost 5000 genes of potential interest in cancer genetics. Results. The latest release of COSMIC (version 37; July 2008) now holds full and up-to-date curation of over 5,900 scientific papers, examining over 268,000 tumours, in which over 59,000 mutations are detailed through 60 pointmutated genes. Fusion gene products have been curated for 16 pairs of genes, described through over 4200 tumours. 2246 papers were rejected during manual curation, usually due to significant inconsistencies in the publication. A relational database holds the captured information, which is warehoused for each release. The information is presented on the internet with a series of graphical and tabulated views aiding navigation and interpretation. Conclusions. The current version of COSMIC is close to fulfilling its original intentions, with curation of most pointmutated genes in cancer complete. However, new challenges are emerging with the need to calculate the effect of high numbers of observed sequence changes to identify those driving tumour formation, and the need to meaningfully handle the increasing quantities of data from high-throughput screens and next-generation sequencing technologies.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Background</title>
      <p>The scientific literature contains tens of thousands of
publications describing the involvement of somatic
gene mutations in a wide range of human cancer
phenotypes and this is enhanced by a number of online
resources. However, these online resources tend to be
focused on individual loci (e.g. IARC p53 database,
The wide distribution of the data in the literature,
together with the broad spread of available details and
formats makes it very difficult to investigate for
aggregate statistical interpretation. COSMIC is
designed to overcome this limitation by curating all this
data, in as much detail as possible, into one repository
which can be examined easily on the internet. Literature
curation data are presented in tandem with somatic
[1]), not providing genome-wide information on
mutation combinations within tumours, or very broadly
focused, storing minimal information, usually only on
high-frequency mutant alleles, thus losing much context
detail (e.g. OMIM, [2]).
mutation screening derived from ongoing CGP studies.
Large datasets are therefore made available for deep
datamining, whilst maintaining sample sizes which can
still achieve good statistical significance.</p>
      <p>
        The CGP maintains a listing of all genes which are
proved to have an involvement in causing cancer when
mutated, called the Cancer Gene Census (CGC,
http://www.sanger.ac.uk/genetics/CGP/Census/ [
        <xref ref-type="bibr" rid="ref1">3</xref>
        ]).
Describing over 360 genes, this resource has led the
curation efforts of the COSMIC project. Genes with
small somatic intragenic mutations have been
prioritised, followed by genes involved in novel
oncogenic fusions. All the genes released in COSMIC
are periodically revisited to ensure their mutation data
are maintained up-to-date,
All the data in COSMIC is manually curated, so as to
maximise the precision of the data as it is interpreted
from published sources. Manual curation also provides
significant feedback on the quality of data being curated
as well as the methods used in its interpretation; to
ensure the quality of COSMIC's data, 2246 papers have
been rejected during the curation process due to the
absence of mandatory datapoints or significant
inconsistencies in the publication. In addition to this
manual curation, COSMIC's subproject, the "CGP
Cancer Cell Line Project" evaluates each observed
variant for its likely impact in tumour formation,
specifically releasing only those variants considered to
have a deleterious effect.
      </p>
    </sec>
    <sec id="sec-2">
      <title>Results</title>
      <p>Curation of the literature for 62 genes with oncogenic
point mutations is now nearing completion (with the
notable exception of TP53 which is independently
curated; IARC TP53 database, [1]). Additionally, 15
genes have been assessed for their involvement in
fusion events. Whilst the literature curation domain in
COSMIC aims to completely capture the mutation data
on a small set of known cancer genes resulting in large
sample counts, the CGP domain aims to examine a
small number of tumours through a large set of almost
4,800 candidate genes searching for novel
oncomutations. COSMIC has now captured results on a
total of 268,938 tumours which have been examined
through 4,773 genes in various combinations
representing 1,019,304 individual experiments. 5,902
publications, together with unpublished CGP
contributions have described 59,187 small intragenic
mutations and 2,266 instances of novel fusion
mutations.</p>
      <p>
        COSMIC's website aims to make searching, navigation
and interpretation as easy as possible by providing
much graphical summary with the contents of each
image reflected in a table with links to further details
(figure 1). Detailed descriptions of the system's
concepts, contents and usage have recently been
published [
        <xref ref-type="bibr" rid="ref2">4</xref>
        ].
      </p>
    </sec>
    <sec id="sec-3">
      <title>Discussion</title>
      <p>
        The extraction of complex mutation annotation together
with clinical and phenotypic details about tumour
samples has proved beyond any retrieval method other
than manual. Building the COSMIC dataset has taken a
team of manual curators eight years and is still ongoing.
Attempts to automatically text-mine scientific
publications could retrieve simple data components, but
rarely captured the rich contextual information that
makes COSMIC so successful in cancer genetics.
With the advent of systematic candidate gene screening
and next generation sequencing technologies, the size
and scope of screens is increasing. In one of the largest
analysis to date, 22 tumours were examined through
over 18,000 genes in an attempt to find novel oncogenic
variants [
        <xref ref-type="bibr" rid="ref3">5</xref>
        ]. Due to the size of these studies, finding
and interpreting the supplementary datasheets where the
results were stored is a complex and time-consuming
procedure, not suited to text-mining programs or
humans; a semi-manual/semi-automated approach is
now being explored; perhaps sample context can be
retrieved manually and simple lists of genes and
mutations automatically interpreted. With the recent
advent of whole-genome sequencing technologies these
complex contextual problems will become even more
problematic, increasingly likely to require such an
approach. The CGP has reported its first successful
examinations of whole genomes using next-generation
sequencers [6]; tumours can now receive annotations
across their entire genome, not just of mutations in
putative cancer genes, but of genome mutations and
rearrangements regardless of their position relative to
known coding or regulatory sequences. With total
mutation numbers in the thousands, the storage and
navigation of these data in COSMIC will need to be
radically altered, offering novel navigation and
graphical overviews. Integrating this into the current
system is an upgrade challenge currently underway,
keeping COSMIC up-to-date as mutation detection
technology improves apace.
      </p>
      <p>COSMIC's presentation of complex mutation data in a
phenotypic context has proved very successful, with the
website consistently registering around 400,000 page
impressions per week in 2008. However, imparting
meaning to each variant has become an immensely
complex proposition, especially with the larger
systematic screens examining anonymous genes of
unknown function. Most of COSMIC makes very little
distinction between mutations known to cause cancer
and passenger mutations, unless this is discussed in the
publication being curated, which is rare. Whilst it is
easy to define, for instance, a frameshift mutation in a
tumour suppressor gene as an oncogenic variant, it can
be very difficult to determine the oncogenic
consequence of a novel missense mutation. CGP's
"Cancer Cell Line Project", displayed on COSMIC's
green pages is the only part of the system to make this
causal distinction. With highly unstable genomes, cell
lines provide mutation hunters with many variants,
many of which will have no significant oncogenic
consequence. COSMIC's green pages therefore, only
record and display those with obvious or previously
characterised oncogenic characteristics; any variants not
obviously oncogenic are not shown. How this
consequence is measured, is again a manual process
largely based upon previous published evidence that the
variant has been observed somatic before or that it has
clear functional consequences [7].</p>
      <p>
        Many of the well known mutations in cancer have been
fully characterised; for instance, mutations at codons
12,13 and 61 in KRAS are known to interfere with the
binding of RAS and its inactivator, thus extending the
molecule's signalling, upregulating cellular growth via
the MAPK/ERK signalling cascade [8]. However,
missense mutations are recorded at over 20 other sites
in this gene and the meaning of these is much less clear,
even when the positions of these residues in the 3D
structure of the molecule is easily identifiable. Again,
software is available (e.g. SIFT [
        <xref ref-type="bibr" rid="ref4">9</xref>
        ], CanPredict [
        <xref ref-type="bibr" rid="ref5">10</xref>
        ]) to
help with this interpretation and a statistical evaluation
of the output of many such prediction programs may
well be the way forward in helping sift for the
significant oncomutations requiring further
examination. Calculating the mutation consequence for
the larger number of variants from systematic screens is
an immediate challenge in cancer genetics. The
majority of sequence changes identified in such screens
are likely to be passengers rather than drivers, and the
ability to differentiate the two with high-throughput
methodologies is becoming a necessity. COSMIC's
high-quality systematic curation and storage of cancer
mutations provides the ideal framework and test
datasets on which to found an effort to develop these
new standards and algorithms.
      </p>
    </sec>
    <sec id="sec-4">
      <title>References</title>
      <p>1. Petitjean, A., Mathe, E., Kato, S., Ishioka, C.,
Tavtigian,S.V., Hainaut, P., Olivier, M. 2007. Impact of
mutant p53 functional properties on TP53 mutation
patterns and tumor phenotype: lessons from recent
developments in the IARC TP53 database. Hum Mutat.
28:622-29.
2. Hamosh, A., Scott, A.F., Amberger, J., Bocchini, C.,
Valle, D., McKusik, V.A. 2002. Online Mendelian
Inheritance in man (OMIM), a knowledgebase of
human genes and genetic disorders. Nucl. Acids Res.
30:52-55.
6. Campbell P.J., Stephens P.J., Pleasance E.D.,
O'Meara S., Li H., Santarius T, Stebbings LA, Leroy C,
Edkins S, Hardy C, Teague JW, Menzies A, Goodhead
I, Turner DJ, Clee CM, Quail MA, Cox A, Brown C,
Durbin R, Hurles ME, Edwards PA, Bignell GR,
Stratton MR, Futreal PA. 2008. Nat Genet 40:685-686.
7. Ikediobi ON, Davies H, Bignell G, Edkins S, Stevens
C, O'Meara S, Santarius T, Avis T, Barthorpe S,
Brackenbury L, Buck G, Butler A, Clements J, Cole J,
Dicks E, Forbes S, Gray K, Halliday K, Harrison R,
Hills K, Hinton J, Hunter C, Jenkinson A, Jones D,
Kosmidou V, Lugg R, Menzies A, Mironenko T, Parker
A, Perry J, Raine K, Richardson D, Shepherd R, Small
A, Smith R, Solomon H, Stephens P, Teague J, Tofts C,
Varian J, Webb T, West S, Widaa S, Yates A, Reinhold
W, Weinstein JN, Stratton MR, Futreal PA, Wooster R.
Mutation analysis of 24 known cancer genes in the
NCI-60 cell line set. Mol Cancer Ther. 2006,
5:260612.
8. Scheffzek K, Ahmadian MR, Kabsch W, Wiesmuller
L, Lautwein A, Schmitz F, Wittinghofer A. The
RasRasGAP complex:structural basis for GTPase
activation and its loss in oncogenic RAS mutants. 1997.
Science 277: 333-338.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          3.
          <string-name>
            <surname>Futreal</surname>
            ,
            <given-names>P.A</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Coin</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Marshall</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Down</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hubbard</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wooster</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rahman</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stratton</surname>
            ,
            <given-names>M.R.</given-names>
          </string-name>
          <year>2004</year>
          .
          <article-title>A census of human cancer genes</article-title>
          .
          <source>Nature Reviews Cancer</source>
          <volume>4</volume>
          :
          <fpage>177</fpage>
          -
          <lpage>183</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          4.
          <string-name>
            <surname>Forbes</surname>
            <given-names>SA</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bhamra</surname>
            <given-names>G</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bamford</surname>
            <given-names>S</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dawson</surname>
            <given-names>E</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kok</surname>
            <given-names>C</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Clements</surname>
            <given-names>J</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Menzies</surname>
            <given-names>A</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Teague</surname>
            <given-names>JW</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Futreal</surname>
            <given-names>PA</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stratton</surname>
            <given-names>MR</given-names>
          </string-name>
          .
          <article-title>The Catalogue of Somatic Mutations in Cancer (COSMIC)</article-title>
          .
          <source>Curr Protoc Hum Genet</source>
          .
          <year>2008</year>
          , Chapter 10:Unit 10.
          <fpage>11</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          5.
          <string-name>
            <surname>Wood</surname>
            <given-names>LD</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Parsons</surname>
            <given-names>DW</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jones</surname>
            <given-names>S</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lin</surname>
            <given-names>J</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sjöblom</surname>
            <given-names>T</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Leary</surname>
            <given-names>RJ</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shen</surname>
            <given-names>D</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Boca</surname>
            <given-names>SM</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Barber</surname>
            <given-names>T</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ptak</surname>
            <given-names>J</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Silliman</surname>
            <given-names>N</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Szabo</surname>
            <given-names>S</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dezso</surname>
            <given-names>Z</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ustyanksky</surname>
            <given-names>V</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nikolskaya</surname>
            <given-names>T</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nikolsky</surname>
            <given-names>Y</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Karchin</surname>
            <given-names>R</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wilson</surname>
            <given-names>PA</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kaminker</surname>
            <given-names>JS</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhang</surname>
            <given-names>Z</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Croshaw</surname>
            <given-names>R</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Willis</surname>
            <given-names>J</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dawson</surname>
            <given-names>D</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shipitsin</surname>
            <given-names>M</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Willson</surname>
            <given-names>JK</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sukumar</surname>
            <given-names>S</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Polyak</surname>
            <given-names>K</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Park</surname>
            <given-names>BH</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pethiyagoda</surname>
            <given-names>CL</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pant</surname>
            <given-names>PV</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ballinger</surname>
            <given-names>DG</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sparks</surname>
            <given-names>AB</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hartigan</surname>
            <given-names>J</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Smith</surname>
            <given-names>DR</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Suh</surname>
            <given-names>E</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Papadopoulos</surname>
            <given-names>N</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Buckhaults</surname>
            <given-names>P</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Markowitz</surname>
            <given-names>SD</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Parmigiani</surname>
            <given-names>G</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kinzler</surname>
            <given-names>KW</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Velculescu</surname>
            <given-names>VE</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vogelstein</surname>
            <given-names>B.</given-names>
          </string-name>
          <article-title>The genomic landscapes of human breast and colorectal cancers</article-title>
          .
          <source>Science</source>
          .
          <year>2007</year>
          ,
          <volume>318</volume>
          :
          <fpage>1108</fpage>
          -
          <lpage>13</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          9.
          <string-name>
            <surname>Ng</surname>
            <given-names>PC</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Henikoff</surname>
            <given-names>S.</given-names>
          </string-name>
          <year>2003</year>
          .
          <source>Nucleic Acids Res</source>
          <volume>31</volume>
          :
          <fpage>3812</fpage>
          -
          <lpage>4</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          10.
          <string-name>
            <surname>Kaminker</surname>
            <given-names>JS</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhang</surname>
            <given-names>Y</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Watanabe</surname>
            <given-names>C</given-names>
          </string-name>
          , Zhang Z.
          <article-title>CanPredict: a computational tool for predicting cancerassociated missense mutations</article-title>
          .
          <source>2007. Nucleic Acids Res</source>
          .
          <volume>35</volume>
          :
          <fpage>W595</fpage>
          -
          <lpage>8</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>