<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Constructing prototypes for classi cation using epigenetic and genetic analysis</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Christopher L. Bartlett</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Intelligent Bio Systems Laboratory, Biomedical and Health Informatics State University of New York at Oswego</institution>
          ,
          <addr-line>7060 NY-104, Oswego, NY 13126</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Researchers seek to identify biological markers which accurately di erentiate cancer subtypes and their severity from normal controls. One such biomarker, DNA methylation, has recently become more prevalent in genetic research studies in oncology. This project seeks to apply the innovative and adaptive machine learning methodology in case-based reasoning (CBR) to examine DNA methylation levels in breast cancer. Instead of relying on a generalized knowledge-base, CBR uses highly speci c information extracted from similar cases which can also greatly expedite the process of nding a solution. Further, this can locate targeted biomarkers by reusing homogenous factors, or revising to locate novel biomarkers in highly heterogeneous samples. While locating these biomarkers, this project proposes to use CBR to classify samples, predict prognoses and determine survival factors.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>The term epigenetics was rst introduced into modern biology by Conrad
Waddington as a means of de ning interactions between genes and their products that
result in phenotypic variations. Waddington's landscape presents a cell
becoming more di erentiated as time goes on. One of the events that can cause this
di erentiation is methylation. Methylation is a covalent attachment of a methyl
group to cytosine. Cytosine (C) is one of the four bases that construct DNA
and one of only two bases that can be methylated. While adenine can be
methylated as well, cytosine is typically the only base that's methylated in mammals.
Once this methyl group is added, it forms 5-methylcytosine where the 5
references the position on the 6-atom ring where the methyl group is added. Under
the majority of circumstances, a methyl group is added to a cytosine followed
by a guanine (G) which is known as CpG. While the methyl group is added
onto the DNA, it doesn't alter the underlying sequence but it still has profound
e ects on the expression of genes and the functionality of cellular and bodily
functions. Methylation at these CpG sites has been known to be a fairly
stable epigenetic biomarker that usually results in silencing the gene. Further, the
Copyright © 2019 for this paper by its authors. Use permitted
under Creative Commons License Attribution 4.0 International (CC BY 4.0).
amount of methylation can be increased (known as hypermethylation) or
decreased (known as hypomethylation) and improper maintenance of epigenetic
information can lead to a variety of human diseases.</p>
      <p>
        Within the domain of case-based reasoning (CBR), there exist several
applications using microarray data. Anaissi, Goyal, Catchpoole, Braytee, and Kennedy
[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], for example, attempted to navigate the complexity of the highly-dimensional
and imbalanced datasets often found in microarray analysis by focusing on case
retrieval. Their framework uses a k-nearest neighbor (kNN) classi er with a
weighted feature-based similarity measure to retrieve similar patients from a
case base of acute lymphblastic leukemia. Gene expression data is employed to
determine this similarity, and the treatment and outcome is used to propose
solutions. Feature selection, dimensionality reduction, and feature weighting is used
to handle the high-dimensionality of the data and removal of irrelevant features.
They utilize oversampling to deal with the imbalanced classes. More speci cally,
they use the synthetic minority oversampling technique (SMOTE)
methodology which arti cially creates minority samples based on interpolation between
members of the original minority class. After these pre-processing stages, a new
sample is given to the kNN classi er to retrieve similar cases.
      </p>
      <p>
        A bit unorthodox, Yao and Li, [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], considered microarray samples in each
class as one case-base. Then, given a sample, they retrieve several similar cases
from each of the case-bases. Testing on leukemia, colon, and cancer data, Yao
and Li retrieved results that outperformed several classic algorithms, including
a few which used case-based reasoning.
      </p>
      <p>
        Ramos-Gonzalez et al., [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] used a two-level feature selection process for gene
expression data in squamous cell carcinoma and adenocarcinoma. Their
methodology has a preliminary feature selection which uses a non-parametric
MannWhitney test to locate genes whose expression levels variation are statistically
di erentiated between subtypes. Following is a feature selection stage with
Gradient Boosted Regression Trees that further re nes the feature list into a greatly
reduced subset that still maintains a high classi cation accuracy. A
distancebased approach is used to retrieve similar cases, while additional diagnostic
information may be requested that assists in correcting the prediction.
      </p>
      <p>
        More recently, Lamy, Sekar, Guezennec, Bouaud and Seroussi [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] proposed
a CBR method that visualizes results. The CBR system was rather
straightforward, retrieving cases through a distance measure, though their specialization
was in the explainability. Qualitative attributes between cases were shown
using rainbow boxes, where labeled and colored rectangles extend through columns
that represent the cases, clearly showing what was similar or dissimilar between
cases. Quantitative attributes are provided in scatter plots that center on the
query case and accurately displays the similar cases.
      </p>
      <p>Advantages of CBR are its ability to generalize, and explainability. These
factors will lend to an informative view of the epigenetic state of a cancer sample,
and will hopefully assist in determining the heterogeneity of speci c subgroups
of samples.</p>
    </sec>
    <sec id="sec-2">
      <title>Research Plan</title>
      <p>The proposed research project seeks to employ CBR in an investigation of the
epigenetic factors of breast cancer. Feature selection methods will be tested and
evaluated to hone in on highly speci c areas of the epigenome that have been
impacted. A CBR framework to classify cancer samples, predict cancer prognoses
and calculate survival is planned, with the underlying pathophysiological impacts
of the cancer being investigated along the way. Prototypical representations of
the the cancer and the clinical subgroups will also be researched.
2.1</p>
      <p>Research Aims
1. To construct a case-based reasoning framework for classi cation of epigenetic
data in breast cancer which takes covariate factors into account. Primary
work here will focus on retrieving similar cases based on clinical and
epigenetic similarity and using previously located labels to classify novel cases.
In areas of dissimilarity, prior cases will be adapted to conform to the novel
case. Integrating clinical factors has been shown to increase prediction
ability (van Vliet et al., 2012) and prognostic performance (Zhu et al., 2017).
It is hypothesized that the inclusion of these factors will lead to greater
heterogeneity of found biomarkers as well as greater biological relevance.
2. To extend the established framework to predicting cancer prognoses. After
the construction of a CBR framework for classi cation, prediction becomes
a natural and swift process. Here, sample similarities will be retrieved and
used to determine patient outcomes with modi cations occurring where its
necessary.
3. To further extend the established framework for survival analyses. Similar to
Aim 1 and 2, similar samples will be retrieved though the goal at this phase
is to locate the epigenetic signatures relevant to prolonged patient survival.
4. To locate deep pathophysiological pathways that have been impacted by
cancer.
5. To establish a prototypical representation of cancer and clinical subgroups.
6. Extend the model for the reuse of prototypes for classi cation, prediction
and survival analysis.
3</p>
    </sec>
    <sec id="sec-3">
      <title>Progress-To-Date</title>
      <p>Work was just completed using DNA methylation to classify breast cancer
samples from normal tissue samples. The rst stage was to investigate the most
diverse of these cases, stage 4 cancer versus normal tissue. Classi cation was
performed using naive bayes (NB), random forest (RF), and k-nearest
neighbor with 3 iterations of k at a stage after surrogate variable analyses, after
di erentially-methylated position analyses, and after di erentially-methylated
region analyses. Finally, methylation probes at each genomic region within a
particular gene were averaged and features were selected to nd the highest
performing genomic regions. The genes with the highest performing genomic regions
were then mapped to KEGG functional pathways and for the top 4 functional
pathways, the associated genes were used to classify a larger set of cancer
samples from a variety of stages to normal tissue. The four pathways were olfaction
transduction, neuroactive ligand-receptor interaction, nicotine addiction, and
GABAergic synapse. Results of this classi cation process are in Table 1.</p>
      <p>While this methodology held strong results, all iterations of the dataset
suffered from a class-imbalance and whether or not over tting occurred cannot
yet be deduced. With these issues in mind, it is hopeful that the generation of
a strong prototype through which to compare samples will allow a one-to-one
correspondence that eliminates class-imbalance and strengthens classi cation
results. If the prototype is able to be visualized, it would expand its strength
and allow for downstream views into which biological mechanisms lend to the
prototype's accuracy. Further, stage 4 samples were selected to represent a
heterogeneous group in regards to the epigenetic state, but the small sample size
removed the possibility of separating by clinical factors and still locating
meaningful information. It is believed that a case-based reasoning approach would
mitigate these issues and produce stronger results.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Anaissi</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Goyal</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Catchpoole</surname>
            ,
            <given-names>D.R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Braytee</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kennedy</surname>
            ,
            <given-names>P.J.:</given-names>
          </string-name>
          <article-title>Casebased retrieval framework for gene expression data</article-title>
          .
          <source>Cancer Informatics</source>
          <volume>14</volume>
          (
          <year>2015</year>
          ). https://doi.org/10.4137/cin.s22371
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Lamy</surname>
            ,
            <given-names>J.B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sekar</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Guezennec</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bouaud</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sroussi</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>Explainable arti cial intelligence for breast cancer: A visual case-based reasoning approach</article-title>
          .
          <source>Arti cial Intelligence in Medicine</source>
          <volume>94</volume>
          ,
          <issue>4253</issue>
          (
          <year>2019</year>
          ). https://doi.org/10.1016/j.artmed.
          <year>2019</year>
          .
          <volume>01</volume>
          .001
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Ramos-Gonzlez</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lpez-Snchez</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Castellanos-Garzn</surname>
            ,
            <given-names>J.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Paz</surname>
            ,
            <given-names>J.F.D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Corchado</surname>
            ,
            <given-names>J.M.:</given-names>
          </string-name>
          <article-title>A cbr framework with gradient boosting based feature selection for lung cancer subtype classi cation</article-title>
          .
          <source>Computers in Biology and Medicine</source>
          <volume>86</volume>
          ,
          <issue>98106</issue>
          (
          <year>2017</year>
          ). https://doi.org/10.1016/j.compbiomed.
          <year>2017</year>
          .
          <volume>05</volume>
          .010
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Yao</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>S.:</given-names>
          </string-name>
          <article-title>Anmm4cbr: a case-based reasoning method for gene expression data classi cation</article-title>
          .
          <source>Algorithms for Molecular Biology</source>
          <volume>5</volume>
          (
          <issue>1</issue>
          ) (
          <year>2010</year>
          ). https://doi.org/10.1186/
          <fpage>1748</fpage>
          -7188-5-14
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>