=Paper=
{{Paper
|id=Vol-3178/CIRCLE_2022_paper_26
|storemode=property
|title=A Study on the Impact of Class Distribution on Deep Learning—The Case of Histological Images and Cancer Detection - Extended Abstract
|pdfUrl=https://ceur-ws.org/Vol-3178/CIRCLE_2022_paper_26.pdf
|volume=Vol-3178
|authors=Ismat Ara Reshma,Josiane Mothe,Sylvain Cussat-Blanc,Hervé Luga,Camille Franchet,Margot Gaspard,Pierre Brousset,Radu Tudor Ionescu
|dblpUrl=https://dblp.org/rec/conf/circle/ReshmaMCLFGBI22
}}
==A Study on the Impact of Class Distribution on Deep Learning—The Case of Histological Images and Cancer Detection - Extended Abstract==
A Study on the Impact of Class Distribution on Deep Learning—The Case of Histological Images and Cancer Detection - Extended Abstract Ismat Ara Reshma1,* , Josiane Mothe1,* , Sylvain Cussat-Blanc1 , Hervé Luga1 , Camille Franchet2 , Margot Gaspard2 , Pierre Brousset2 and Radu Tudor Ionescu3 1 IRIT, UMR5505 CNRS, Univ. de Toulouse, 118 Route de Narbonne, Toulouse, F-31062 CEDEX 09, France 2 Dept. of Pathology, Univ. Cancer Institute of Toulouse-Oncopole, 1 avenue Irène Joliot-Curie, Toulouse, F-31059 France 3 Univ. of Bucharest, 14 Academiei, Bucharest 010014, Romania Abstract Studies on deep learning tuning mostly focus on the neural network architectures and algorithms hyper- parameters. Another core factor for accurate training is the class distribution of the training dataset. This paper contributes to understanding the optimal class distribution on the case for histological images used in cancer diagnosis. We formulate several hypotheses, which are then tested considering experiments with hundreds of trials. We considered both segmentation and classification tasks considering the U-net and group equivariant CNN (G-CNN). This paper is an extended abstract of another paper published by the authors1 . Keywords Computer-aided diagnosis, medical information retrieval, image segmentation and classification, deep learning, class-biased training, class distribution analysis, histological image The huge success of deep learning models, such as convolutional neural networks (CNNs), in visual recognition has encouraged researchers to explore their use in various domains, including cancer detection from histological images. Histological images or whole slide images (WSIs) are digitalized histological slides. The methods for automatic cancer detection are mainly focused on end-to-end pipeline systems. The success of such systems depends on several hyper-parameters. We hypothesized that one of the most important hyper-parameters is the class distribution of the training set, as it provides the supervision for all learning-based systems. In machine learning, an imbalanced data distribution has been shown to lead to inferior models compared to a balanced distribution in many domains including biomedical and information 1 Reshma IA, Franchet C, Gaspard M, Ionescu RT, Mothe J, Cussat-Blanc S, Luga H, Brousset P. Finding a Suitable Class Distribution for Building Histological Images Datasets Used in Deep Model Training—The Case of Cancer Detection. Journal of Digital Imaging. 2022 Apr 20:1-24. CIRCLE ’22: Conference of the Information Retrieval Communities in Europe, July 04–07, 2022, Samatan, France * Corresponding author. $ Ismat-Ara.Reshma@irit.fr (I. A. Reshma); Josiane.Mothe@irit.fr (J. Mothe); Sylvain.Cussat-Blanc@irit.fr (S. Cussat-Blanc); Herve.Luga@irit.fr (H. Luga); Franchet.Camille@iuct-oncopole.fr (C. Franchet); Brousset.p@chu-toulouse.fr (P. Brousset); raducu.ionescu@gmail.com (R. T. Ionescu) https://www.irit.fr/~Josiane.Mothe/ (J. Mothe) 0000-0002-9917-6668 (I. A. Reshma); 0000-0001-9273-2193 (J. Mothe); 0000-0001-8675-197X (H. Luga); 0000-0002-3214-0142 (C. Franchet); 0000-0002-8629-3291 (P. Brousset); 0000-0002-9301-1950 (R. T. Ionescu) © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) retrieval2 . Balanced distribution became the default choice in deep learning state-of-the-art methods3 , although it is not optimal in all cases. There are very few analytical studies on the performance impact of different distributions. They were mainly conducted on toy datasets, even though real datasets may be very different and more complex. There is no evidence that the conclusions of these studies would be appropriate for cancer WSIs. We present a data-driven analysis which determines the performance impact of different class distributions on training data. We derived several hypotheses with regard to WSIs used for cancer detection. WSIs comprise regions of interest (ROI), where pathologists look for any abnormalities, and the non-ROI. We tested the hypotheses with both image segmentation and classification tasks. Data imbalance (class bias) is a common problem in machine learning, and many methods have been proposed to make data balanced4 . A separate analysis is certainly required for each special kind of data following the No Free Lunch Theorem5 : none single model works best for every task. Moreover, deep CNNs have shown incredible performance levels with regard to cancer detection in WSIs. Bejnordi et al. organised a world-wide challenge known as CAMELYON on cancer detection in WSIs 6 . Most of the proposed methods in the CAMELYON16 challenge were based on deep learning; the variation in the participants’ results is induced by hyper-parameter settings and data pre-processing. The winning team7 trained two 22-layer GoogleNets (V1), one with randomly sampled training patches–probably biased towards negative examples–and another with additional hard negative examples. In this work, we consider four categories of patches: ROI categories, cancer (C), non-cancer (¬C), or multi-label mixed (C&¬C) and the other (O), non-ROI category. We make several hypotheses and design several experiments with the relevant class distributions to be able to test the proposed hypotheses. The total number of patches in the training set of each experiment is kept the same to ensure fair comparison but their distribution differs. We introduce U to denote a unit (fixed number) of patches. Here, the results for segmentation are reported while in the initial paper we considered both binary classification and segmentation. At the training step, we generate different class distributions in the training set (See Table 1). The generated training set is used to train a fully convolutional neural network (FCNN) U-net8 . 2 Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP. SMOTE: synthetic minority over-sampling technique. Journal of artificial intelligence research. 2002 Jun 1;16:321-57. 3 Halicek M, Shahedi M, Little JV, Chen AY, Myers LL, Sumer BD, Fei B. Head and neck cancer detection in digitized whole-slide histology using convolutional neural networks. Scientific reports. 2019 Oct 1;9(1):1-1. 4 Prati RC, Batista GE, Silva DF. Class imbalance revisited: a new experimental setup to assess the performance of treatment methods. Knowledge and Information Systems. 2015 Oct;45(1):247-70. 5 Wolpert DH. The lack of a priori distinctions between learning algorithms. Neural computation. 1996 Oct 1;8(7):1341- 90. 6 Bejnordi BE, Veta M, Van Diest PJ, Van Ginneken B, Karssemeijer N, Litjens G, Van Der Laak JA, Hermsen M, Manson QF, Balkenhol M, Geessink O. Diagnostic assessment of deep learning algorithms for detection of lymph node metastases in women with breast cancer. Jama. 2017 Dec 12;318(22):2199-210. 7 Wang D, Khosla A, Gargeya R, Irshad H, Beck AH. Deep learning for identifying metastatic breast cancer. arXiv preprint arXiv:1606.05718. 2016 Jun 18. 8 Ronneberger O, Fischer P, Brox T. U-net: Convolutional networks for biomedical image segmentation. InInter- national Conference on Medical image computing and computer-assisted intervention 2015 Oct 5 (pp. 234-241). Springer, Cham. Table 1 Experiment settings. E1 setting is designed to test H1. E2, E3, and E4 test H2 in different settings with single-label, multi-label, and non-ROI patches, respectively. Comparison between E2 and E3 tests H3. Comparison between E3 and E4 tests H4. There is a total of 9 units (U) of training patches in E1 and 4U in the other settings. Here, 1 U=5000 patches. Experiment ID Distribution Patch ratio (O : C : ¬C : C&¬C) E1.a Balanced 3:3:3:0 E1.b Over-represented O (natural) 7 : 1 : 1 : 0 E2.a Balanced 0:2:2:0 E2.b Over-represented ¬C 0:1:3:0 E2.c Over-represented C 0:3:1:0 E3.a Balanced 0 : 1.5 : 1.5 : 1 E3.b Over-represented ¬C 0:0:3:1 E3.c Over-represented C 0:3:0:1 E4.a Balanced 1:1:1:1 E4.b Over-represented ¬C 1:0:2:1 E4.c Over-represented C 1:2:0:1 During inference, the trained model is employed to predict the patches extracted from unseen test WSIs. Since false positive (FP) is still an ongoing problem in cancer detection in WSI, we focus on minimizing FPs and utilize FP-based evaluation metrics, although false negative (FN)-based metrics are also considered. Specifically, we test our hypotheses by employing receiver operating characteristic (ROC) curve, precision-recall (PR) curve, precision, and false positive rate (FPR) curves, although here, because of the page limit, we present the latter curves only. To generate the result, we used the Metastatic Lymph Node dataset from the University Cancer Institute of Toulouse-Oncopole, which is abbreviated as MLNTO. We extracted 127,898 (15,328 belong to C) and 101,262 (17,351 belong to C) patches from the training and test sets, respectively (see our original paper1 for detail.) There is no duplicate patches, but homogeneous and heterogeneous patches occur. H1: Balanced distribution is optimal for training a model. To test H1, we designed two experiments: E1.a and E1.b. In E1.a we consider the same number of patches in each of the three classes (C, ¬C, O), whereas in E1.b the training examples are highly biased (7 times) towards class O (similar to the natural distribution) as presented in Table 1. To test H1, a total of 9U of patches is used to create both the natural and balanced distributions. According to the result (see Figure 1), the natural distribution (blue curve) is better than the balanced one (green curve). The same result holds when considering the ROC and PR curves. H2: Over-representing the ¬C class in the training set reduces false positives during cancer detection. In experiment E2.a (see E2 settings in Table 1), we consider the balanced case between C and ¬C, while E2.b over-represents ¬C and E2.c over-represents C. We found that ¬C-biased distribution (blue curve) is better than the two other distributions: Figure 1: Natural distribution (E1.b, blue) is better than balanced distribution (E1.a, green) (¬H1). Mean precision and FPR curves for 10 runs for balanced (E1.a) and over-representation of O (E1.b) distributions; color shading is the standard deviation. Figure 2: All hypotheses tested true. Here, E2 (with single-label only), E3 (with multi-label), and E4 (with non-ROI) test H2 in different settings; comparison between E2 and E3 tests H3; comparison between E3 and E4 tests H4. Same notion as Figure 1. the balanced (green curve) and the C-biased (red curve) ones. H2 is true according to both precision and FPR curves (see Figure 2). H3: Multi-label examples are more useful than single-label examples as training data. We design three experiments (E3 settings in Table 1). First, in E3.a, we considered a balanced case between C and ¬C. Then, similarly to E2, in E3.b and E3.c we considered over-represented ¬C and over-represented C cases. Experiments with multi-label examples (E3) are better than the ones with single-label (E2) with an exception for the C-biased case (E3.c) (see E2 and E3 in Figure 2). The exception occurs because of increasing the C bias in E3.c than in E2.c (see Table 1). H3 is thus true according to the precision and FPR curves. When comparing the ¬C-biased case in the current setting (E3.b) with the balanced (E3.a) and C-biased (E3.c) cases, ¬C-biased produces less false positives, i.e., H2 is thus also true in this setting (see Figure 2, E3). H4: Non-ROI data are useful for training. We designed three experiments denoted as E4.* in Table 1. The first purpose is to test H4 by comparing E4 with E3; the second is to re-test H2 with the current E4 settings. When comparing the precision and FPR curves of the experiments with non-ROI data (E4) with the ones without non-ROI (E3), H4 is true (see E3 and E4 in Figure 2). When comparing the ¬C-biased case in the current setting (E4.b) with the balanced (E4.a) and C-biased (E4.c) cases, ¬C-biased produces less false positives (see Figure 2, E4); H2 is thus true here as well. To conclude, in this research which was published in details in another paper of ours1 , we performed a data-level analysis to determine the optimal distribution of the classes in the training set for WSIs when using deep learning. In natural distribution, the WSI data is highly biased towards the non-ROIs. Common practice is to artificially balance the classes while there is no evidence this is accurate. To the best of our knowledge, our analysis is pioneering in the case of class distribution analysis of WSI data for deep learning models; previous research has focused on end-to-end pipeline development for cancer detection. We show that non-ROI easy to annotate patches help the model training. This result will be helpful for researchers who are building a training dataset of WSIs or other applications in which annotation is costly. Such analyses could also help in other real-world problems where data have a complex historyregarding the importance of building a training set with proper distribution.