<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Conference of the Information Retrieval Communities in Europe, July</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>A Study on the Impact of Class Distribution on Deep Learning-The Case of Histological Images and Cancer Detection - Extended Abstract</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Ismat Ara Reshma</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Josiane Mothe</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sylvain Cussat-Blanc</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Hervé Luga</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Camille Franchet</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Margot Gaspard</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Pierre Brousset</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Radu Tudor Ionescu</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Dept. of Pathology, Univ. Cancer Institute of Toulouse-Oncopole</institution>
          ,
          <addr-line>1 avenue Irène Joliot-Curie, Toulouse, F-31059</addr-line>
          <country country="FR">France</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>IRIT, UMR5505 CNRS, Univ. de Toulouse</institution>
          ,
          <addr-line>118 Route de Narbonne, Toulouse, F-31062 CEDEX 09</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Univ. of Bucharest</institution>
          ,
          <addr-line>14 Academiei, Bucharest 010014</addr-line>
          ,
          <country country="RO">Romania</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2022</year>
      </pub-date>
      <volume>0</volume>
      <fpage>4</fpage>
      <lpage>07</lpage>
      <abstract>
        <p>Studies on deep learning tuning mostly focus on the neural network architectures and algorithms hyperparameters. Another core factor for accurate training is the class distribution of the training dataset. This paper contributes to understanding the optimal class distribution on the case for histological images used in cancer diagnosis. We formulate several hypotheses, which are then tested considering experiments with hundreds of trials. We considered both segmentation and classification tasks considering the U-net and group equivariant CNN (G-CNN). This paper is an extended abstract of another paper published by the authors1.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Computer-aided diagnosis</kwd>
        <kwd>medical information retrieval</kwd>
        <kwd>image segmentation and classification</kwd>
        <kwd>deep learning</kwd>
        <kwd>class-biased training</kwd>
        <kwd>class distribution analysis</kwd>
        <kwd>histological image</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>retrieval2. Balanced distribution became the default choice in deep learning state-of-the-art
methods3, although it is not optimal in all cases. There are very few analytical studies on the
performance impact of diferent distributions. They were mainly conducted on toy datasets,
even though real datasets may be very diferent and more complex. There is no evidence that
the conclusions of these studies would be appropriate for cancer WSIs.</p>
      <p>We present a data-driven analysis which determines the performance impact of diferent
class distributions on training data. We derived several hypotheses with regard to WSIs used
for cancer detection. WSIs comprise regions of interest (ROI), where pathologists look for any
abnormalities, and the non-ROI. We tested the hypotheses with both image segmentation and
classification tasks.</p>
      <p>Data imbalance (class bias) is a common problem in machine learning, and many methods
have been proposed to make data balanced4. A separate analysis is certainly required for each
special kind of data following the No Free Lunch Theorem5: none single model works best for
every task.</p>
      <p>Moreover, deep CNNs have shown incredible performance levels with regard to cancer
detection in WSIs. Bejnordi et al. organised a world-wide challenge known as CAMELYON on
cancer detection in WSIs 6. Most of the proposed methods in the CAMELYON16 challenge were
based on deep learning; the variation in the participants’ results is induced by hyper-parameter
settings and data pre-processing. The winning team7 trained two 22-layer GoogleNets (V1),
one with randomly sampled training patches–probably biased towards negative examples–and
another with additional hard negative examples.</p>
      <p>In this work, we consider four categories of patches: ROI categories, cancer (C), non-cancer
(¬C), or multi-label mixed (C&amp;¬C) and the other (O), non-ROI category. We make several
hypotheses and design several experiments with the relevant class distributions to be able to test
the proposed hypotheses. The total number of patches in the training set of each experiment is
kept the same to ensure fair comparison but their distribution difers. We introduce U to denote
a unit (fixed number) of patches. Here, the results for segmentation are reported while in the
initial paper we considered both binary classification and segmentation.</p>
      <p>At the training step, we generate diferent class distributions in the training set (See Table 1).
The generated training set is used to train a fully convolutional neural network (FCNN) U-net8.
2Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP. SMOTE: synthetic minority over-sampling technique. Journal
of artificial intelligence research. 2002 Jun 1;16:321-57.
3Halicek M, Shahedi M, Little JV, Chen AY, Myers LL, Sumer BD, Fei B. Head and neck cancer detection in digitized
whole-slide histology using convolutional neural networks. Scientific reports. 2019 Oct 1;9(1):1-1.
4Prati RC, Batista GE, Silva DF. Class imbalance revisited: a new experimental setup to assess the performance of
treatment methods. Knowledge and Information Systems. 2015 Oct;45(1):247-70.
5Wolpert DH. The lack of a priori distinctions between learning algorithms. Neural computation. 1996 Oct
1;8(7):134190.
6Bejnordi BE, Veta M, Van Diest PJ, Van Ginneken B, Karssemeijer N, Litjens G, Van Der Laak JA, Hermsen M,
Manson QF, Balkenhol M, Geessink O. Diagnostic assessment of deep learning algorithms for detection of lymph
node metastases in women with breast cancer. Jama. 2017 Dec 12;318(22):2199-210.
7Wang D, Khosla A, Gargeya R, Irshad H, Beck AH. Deep learning for identifying metastatic breast cancer. arXiv
preprint arXiv:1606.05718. 2016 Jun 18.
8Ronneberger O, Fischer P, Brox T. U-net: Convolutional networks for biomedical image segmentation.
InInternational Conference on Medical image computing and computer-assisted intervention 2015 Oct 5 (pp. 234-241).
Springer, Cham.
During inference, the trained model is employed to predict the patches extracted from unseen
test WSIs. Since false positive (FP) is still an ongoing problem in cancer detection in WSI,
we focus on minimizing FPs and utilize FP-based evaluation metrics, although false negative
(FN)-based metrics are also considered. Specifically, we test our hypotheses by employing
receiver operating characteristic (ROC) curve, precision-recall (PR) curve, precision, and false
positive rate (FPR) curves, although here, because of the page limit, we present the latter curves
only.</p>
      <p>To generate the result, we used the Metastatic Lymph Node dataset from the University
Cancer Institute of Toulouse-Oncopole, which is abbreviated as MLNTO. We extracted 127,898
(15,328 belong to C) and 101,262 (17,351 belong to C) patches from the training and test sets,
respectively (see our original paper1 for detail.) There is no duplicate patches, but homogeneous
and heterogeneous patches occur.</p>
      <p>H1: Balanced distribution is optimal for training a model. To test H1, we designed two
experiments: E1.a and E1.b. In E1.a we consider the same number of patches in each of the three
classes (C, ¬C, O), whereas in E1.b the training examples are highly biased (7 times) towards
class O (similar to the natural distribution) as presented in Table 1. To test H1, a total of 9U of
patches is used to create both the natural and balanced distributions.</p>
      <p>According to the result (see Figure 1), the natural distribution (blue curve) is better than the
balanced one (green curve). The same result holds when considering the ROC and PR curves.
H2: Over-representing the ¬C class in the training set reduces false positives during
cancer detection. In experiment E2.a (see E2 settings in Table 1), we consider the balanced
case between C and ¬C, while E2.b over-represents ¬C and E2.c over-represents C.</p>
      <p>We found that ¬C-biased distribution (blue curve) is better than the two other distributions:
the balanced (green curve) and the C-biased (red curve) ones. H2 is true according to both
precision and FPR curves (see Figure 2).</p>
      <p>H3: Multi-label examples are more useful than single-label examples as training data.
We design three experiments (E3 settings in Table 1). First, in E3.a, we considered a balanced
case between C and ¬C. Then, similarly to E2, in E3.b and E3.c we considered over-represented
¬C and over-represented C cases.</p>
      <p>Experiments with multi-label examples (E3) are better than the ones with single-label (E2)
with an exception for the C-biased case (E3.c) (see E2 and E3 in Figure 2). The exception occurs
because of increasing the C bias in E3.c than in E2.c (see Table 1). H3 is thus true according to
the precision and FPR curves. When comparing the ¬C-biased case in the current setting (E3.b)
with the balanced (E3.a) and C-biased (E3.c) cases, ¬C-biased produces less false positives, i.e.,
H2 is thus also true in this setting (see Figure 2, E3).</p>
      <p>H4: Non-ROI data are useful for training. We designed three experiments denoted as E4.*
in Table 1. The first purpose is to test H4 by comparing E4 with E3; the second is to re-test H2
with the current E4 settings.</p>
      <p>When comparing the precision and FPR curves of the experiments with non-ROI data (E4)
with the ones without non-ROI (E3), H4 is true (see E3 and E4 in Figure 2). When comparing
the ¬C-biased case in the current setting (E4.b) with the balanced (E4.a) and C-biased (E4.c)
cases, ¬C-biased produces less false positives (see Figure 2, E4); H2 is thus true here as well.</p>
      <p>To conclude, in this research which was published in details in another paper of ours1, we
performed a data-level analysis to determine the optimal distribution of the classes in the
training set for WSIs when using deep learning. In natural distribution, the WSI data is highly
biased towards the non-ROIs. Common practice is to artificially balance the classes while there
is no evidence this is accurate. To the best of our knowledge, our analysis is pioneering in
the case of class distribution analysis of WSI data for deep learning models; previous research
has focused on end-to-end pipeline development for cancer detection. We show that non-ROI
easy to annotate patches help the model training. This result will be helpful for researchers
who are building a training dataset of WSIs or other applications in which annotation is
costly. Such analyses could also help in other real-world problems where data have a complex
historyregarding the importance of building a training set with proper distribution.</p>
    </sec>
  </body>
  <back>
    <ref-list />
  </back>
</article>