A Self-Supervised Learning Approach for Detecting BRCA Mutations in Breast Cancer Histopathological Images Faycal Touazi1,*,† , Djamel Gaceb1,† , Chaima Belkadi1,† and Besma Loubar1,† 1 LIMOSE Laboratory, Computer Science Department, University M’hamed Bougara, Independence Avenue, 35000 Boumerdes, Algeria Abstract Breast and ovarian cancers are among the most pressing health issues affecting women globally, with genetic mutations, particularly in the BRCA1 and BRCA2 genes, significantly influencing their development. This thesis offers a comprehensive overview of these cancers, emphasizing the genetic, anatomical, and histopathological factors that contribute to their onset and progression. A detailed examination of the anatomy of the female breast and ovaries provides insight into the origins of these malignancies. The critical role of histopathology in identifying specific cancer subtypes and gene mutations is explored, underscoring its vital importance in diagnosis and treatment. Our results demonstrate that the developed deep learning framework, integrating Vector Quantized-Variational Autoencoders (VQ-VAE) and DBSCAN for clustering, achieved an accuracy of 95% in classifying BRCA mutation-positive and negative cases, outperforming traditional diagnostic methods. By investigating the interplay between genetic predisposition and histopathological analysis, this thesis aims to enhance the understanding of breast and ovarian cancers and their implications for public health. Keywords Breast Cancer, BRCA mutation, deep learning, Self supervised learning, BRCA 1, BRCA 2 1. Introduction Breast cancer remains one of the most prevalent cancers globally, affecting millions of women each year. Early detection is a critical factor in improving survival rates, as it allows for timely intervention and treatment. Traditional methods for breast cancer detection, such as mammography, have long been the gold standard in screening programs. In recent years, deep learning has emerged as a powerful tool for enhancing breast cancer detection, particularly in medical imaging tasks such as mammography interpretation. Deep learning algorithms, particularly Convolutional Neural Networks (CNNs), have demonstrated remarkable success in breast cancer detection from mammography images. Studies have shown that deep learning models can achieve accuracy levels comparable to radiologists in identifying tumors [1, 2, 3, 4]. Mutations in the BRCA1 and BRCA2 genes are among the most well-known genetic risk factors for breast cancer. These mutations, which can be inherited, significantly increase a woman’s lifetime risk of developing breast cancer. Women who carry a BRCA1 or BRCA2 mutation have an elevated risk of 50 to 85% of developing breast cancer by the age of 70, compared to a 12% risk in the general population [5]. The discovery of a BRCA mutation in a patient is of crucial importance, not only for early diagnosis and cancer management, but also for informed decision making regarding preventive measures and treatment options. Identifying such mutations can lead to personalized surveillance strategies, risk- reducing surgeries, and targeted therapies, thus improving the overall prognosis and quality of life of high-risk patients [6] [7] [8]. IDDM’2024: 7th International Conference on Informatics & Data-Driven Medicine * Corresponding author. † These authors contributed equally. $ f.touazi@univ-boumerdes.dz (F. Touazi); d.gaceb@univ-boumerdes.dz (D. Gaceb); c.belkadi@univ-boumerdes.dz (C. Belkadi); b.loubar@univ-boumerdes.dz (B. Loubar)  0000-0001-5949-5421 (F. Touazi) © 2024 Copyright for this paper by its authors. Use peritted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings Deep learning has revolutionized various domains and its impact on the medical field is particularly profound. The ability of deep learning algorithms to analyze complex patterns in large datasets has led to significant advances in medical diagnostics, treatment planning, and personalized medicine. In the context of medical imaging, deep learning models have demonstrated exceptional accuracy in tasks such as detecting abnormalities, classifying diseases, and predicting patient outcomes. These models, which often outperform traditional methods, have the potential to assist clinicians in making more informed decisions and improving patient care. Our study leverages deep learning techniques to address the challenges associated with detecting BRCA1 and BRCA2 mutations in histopathological images of breast and ovarian cancer. By training a robust model on a curated dataset, we aim to provide a reliable tool for identifying these genetic muta- tions. The results presented in this work highlight the effectiveness of our approach and demonstrate the potential of deep learning in enhancing the accuracy of cancer detection and prognosis. Through this research, we contribute to the growing body of evidence supporting the integration of deep learning into clinical practice, ultimately aiming to improve outcomes for patients with hereditary cancer risks. This paper is organized as follows: Section 2 reviews related works. Section 3 outlines our proposed approach, focusing on the Vector Quantized Variational AutoEncoder (VQ-VAE). Section 4 describes the experimental setup, including the TCGA-BRCA dataset, preprocessing, and evaluation metrics. Section 5 presents the results, covering clustering, BRCA patch classification, and SVS image classification, along with comparisons to related work. Finally, Section 6 concludes with a summary of findings and future research directions. 2. Related Works In this section, we offer a comprehensive review of recent studies that focus on detecting BRCA mutations in breast cancer using deep learning methods. Shen Zhao et al. [9] developed a deep learning framework for comprehensive molecular and prognostic stratifications of triple-negative breast cancer (TNBC). The framework features two CNNs in series: the first, a tissue type classifier based on ResNet-18, achieved a weighted F1 score of 0.96, classifying tissue types with near 90% accuracy. The second CNN predicted molecular features and relapse risks with AUCs ranging from 0.71 to 0.76. Xiaoxiao Wang et al. [10] proposed a deep learning model based on CNNs to predict BRCA gene mutations from histopathological images. Trained on the JSPHCM and JSCH datasets, their model demonstrated high performance with AUC values ranging from 79%. Tristan Lazard et al. [11] employed multiple instance learning (MIL) techniques to identify morpho- logical patterns indicative of homologous recombination deficiency in luminal breast cancers. Their model, tested on a dataset of 673 WSIs from TCGA and an in-house dataset, achieved an AUC of 71%. Nam Nhut Phan et al. [12] developed a deep learning pipeline for classifying breast cancer molecular subtypes from unannotated pathological images. Their approach utilized a two-step transfer learning process with CNNs such as ResNet50, ResNet101, VGG16, and Xception. Initially, the models were pre-trained on ImageNet and then fine-tuned on an internal dataset. They were subsequently trained on the TCGA-BRCA dataset to classify breast cancer into basal, HER2, luminal A, and luminal B subtypes. The images were normalized to 512x512 pixels, and patches were extracted from WSIs. The models achieved average AUCs ranging from 88 to 92%. Kurian et al. [13] proposed a semi-supervised learning approach to classify breast cancer subtypes using histopathological images from the TCGA-BRCA dataset. They focused on differentiating between Basal and Luminal A PAM50 subtypes by analyzing a curated subset of 180 whole slide images (WSIs) selected to minimize heterogeneity. Their model leveraged a Deep Neural Network (DNN) architecture based on SimCLR with a ResNet18 backbone for out-of-distribution (OOD) detection, pre-trained on a large histology image dataset. Patch extraction from annotated tumor regions enabled the model to focus on relevant regions, although it introduced potential label noise. They achieved a patient-level accuracy of 81.43%. The methodology employed by Valieris et al. [14] involved developing a deep learning framework to detect homologous recombination (HR) deficiency in breast tumors using the TCGA-BRCA dataset. The model leveraged whole slide images (WSI), utilizing advanced image processing techniques to extract histopathological features indicative of HR deficiency. To address the complexity and variability in these images, the authors implemented a multiple instance learning (MIL) approach, allowing the model to learn from entire tumor samples without the need for manual segmentation. Their model achieved an area under the curve (AUC) of 80%. Table 1 provides a comparative summary of the performance achieved in state-of-the-art studies for breast cancer classification, highlighting different datasets, methods, and evaluation metrics used. Table 1 Performance Achieved in State-of-the-Art Breast Cancer Studies Reference Dataset Year Methods Metrics Tristan Lazard et al. [11] TCGA 2022 ResNet-18 AUC 71% Xiaoxiao Wang et al. [10] JSPHCM, JSCH 2021n ResNet-18 AUC 79% Kurian et al. [13] TCGA-BRCA 2023 SimCLR 81.34% accuracy Valieris et al. [14] TCGA 2020 Resnet34 AUC 80% Nam Nhut Phan et al. [12] TCGA-BRCA 2021 2-Step ResNet50,101, AUC 92% VGG16, Xception 3. Proposed approach In this section, we describe our proposed approach for the detection and diagnosis of breast cancer using advanced deep learning techniques. Our approach is designed to address the challenges of analyzing histopathological images and aims to provide a comprehensive solution to detect and classify breast masses. But first we will introduce a key architecture in our proposal. 3.1. Vector Quantized Variational AutoEncoder The Vector Quantized Variational AutoEncoder (VQ-VAE)[15] is a type of variational autoencoder that introduces vector quantization to obtain a discrete latent representation, distinguishing itself from traditional VAEs, which produce continuous latent codes. The VQ-VAE uses a codebook, a matrix e of dimensions 𝐾 × 𝐷, where 𝐾 represents the number of embeddings and 𝐷 is the dimensionality of each embedding (see Figure 1). This architecture enables the encoding of data into discrete codes, which helps in learning compact and structured representations. The model consists of three main components: • An encoder: that maps input data 𝑥 (such as images) into continuous latent representations 𝑧. • A vector quantizer: that transforms these continuous representations into discrete vectors 𝑒𝑘 using the codebook, by selecting the closest embedding through minimizing the Euclidean distance: quantization(𝑧) = argmin||𝑧 − 𝑒𝑘 ||2 (1) 𝑒𝑘 ∈𝐸 • A decoder: that reconstructs the original data from the discrete latent codes. The codebook, which contains the learned embeddings, plays a critical role in the quantization process. It allows the continuous output of the encoder to be mapped to discrete codes, facilitating the generation of data from these discrete codes. By using a discrete latent space, VQ-VAE simplifies model optimization and enables the use of generative models based on discrete distributions, such as PixelCNN or other autoregressive models, to model the latent codes. One of the key advantages of VQ-VAE is its ability to capture meaningful structural representations in the data, making it particularly useful for high-quality generation tasks, especially in areas like medical imaging and discrete signal modeling. Figure 1: VQ-VAE Architecture: The image is encoded into a grid of latent vectors. These vectors are replaced by the nearest codebook vector at the bottleneck. Finally, the quantized vectors pass through the decoder to reconstruct the image [15]. 3.2. Methodology In this section, we detail the steps of our methodology for analyzing histopathological images to identify BRCA mutations and classify cancerous versus normal cells. Our approach employs image preprocessing, feature extraction using Vector Quantized-Variational Autoencoder (VQ-VAE), and clustering techniques to organize images based on mutation status, enabling precise classification and improving diagnostic accuracy (see Figure 2). Figure 2: Overview of the proposed architecture for histopathological image analysis. • Step 1: Input Images The process begins with the acquisition of large histopathology images in SVS (Scalable Vector Graphics) format. These images are particularly challenging to handle due to their high resolution and substantial size, which necessitates advanced processing techniques. To manage this, the SVS images are divided into smaller patches of size 1400 × 920 pixels. This approach simplifies the analysis and processing of the images while retaining important details. The dataset comprises a diverse set of patches, • Step 2: Dataset Categorization: After patchifing the dataset into small images were CNN models can process theme, further refinement involves categorizing patches based on BRCA mutation status. The patches are divided into three distinct categories: those related to SVS images where BRCA1 is identified, those where BRCA2 is identified, and those with no identified BRCA mutations. This detailed categorization enables a more focused analysis of IDC patches in relation to specific BRCA mutations, enhancing the understanding of their histopathological features. • Step 3: Image Characterization Using Vector Quantized-Variational Autoencoder (VQ- VAE) In our approach, we utilize a Vector Quantized-Variational Autoencoder (VQ-VAE) [16] with a codebook of 128 discrete vectors to handle and analyze high-resolution histopathological images. This model is crucial for effectively managing the complexity of these images through its encoder-decoder architecture. – Encoder Network: The encoder network transforms high-resolution input images into a continuous latent representation. It consists of multiple neural network layers that extract significant features and reduce the dimensionality of the images while retaining important details. – Vector Quantization: Following the generation of the continuous latent representation, VQ-VAE applies vector quantization. This process maps the continuous latent vectors to the nearest discrete vectors in a predefined codebook of 1024 entries. This quantization step converts the latent space into a more manageable and structured form, which simplifies further analysis. – Codebook: The codebook, comprising 1024 discrete vectors, is updated during training to minimize reconstruction loss. This ensures that the codebook effectively captures the essential characteristics of the input images. – Decoder Network: The decoder network reconstructs the high-resolution images from the quantized latent representation. Using the discrete codes produced by the encoder, the decoder aims to accurately recreate the original images, preserving critical features and details. – Dimensionality Reduction and Efficient Analysis: The combination of the encoder, vec- tor quantization, and decoder facilitates dimensionality reduction of high-resolution images. This reduction compresses the data into a latent space that retains essential information, making the data more suitable for efficient analysis and processing. • Step 4: Feature Extraction with VQ-VAE: To extract meaningful features from the images, we utilize the VQ-VAE model. The VQ-VAE’s encoder network processes the histopathological images to generate continuous latent representations, which are then quantized into discrete vectors using the codebook. This approach captures intricate patterns and features within the images. The aim is to characterize the images with a reduced dimensionality representation, which simplifies and enhances the clustering operation. This method provides a comprehensive and structured feature representation by reducing the dimensionality of the high-resolution images, making it easier to perform effective clustering. • Step 5: Latent Space Representation: After feature extraction, each image is encoded into a latent vector. These latent vectors collectively form a dataset that is used for subsequent analysis. This latent space representation simplifies the data and prepares it for clustering and other operations. • Step 6: Clustering and Annotation: After encoding the images into latent vectors, we apply the DBSCAN (Density-Based Spatial Clustering of Applications with Noise) algorithm to group similar vectors into clusters. This clustering approach organizes the images into meaningful groups based on their feature vectors, distinguishing between BRCA mutation-positive and BRCA mutation-negative cases. • Step 7: Classification: The final stage involves classifying the images into two main categories: normal cells and cancerous cells. This classification is based on the previously obtained clusters and latent space representations. The model aims to enhance the accuracy of differentiating between various types of cancerous and non-cancerous tissues, thereby improving diagnostic capabilities. By integrating VQ-VAE with advanced CNNs and clustering techniques, our approach provides a robust framework for analyzing histopathological images. This methodology is designed to improve the performance of breast cancer detection and diagnosis, offering a more accurate and comprehensive analysis of histological samples. 4. Experimentations and results 4.1. TCGA-BRCA Dataset The TCGA-BRCA dataset, referenced in [17], is part of the Cancer Genome Atlas (TCGA) project, which aims to enhance the understanding of cancer through comprehensive genomic studies. This dataset includes RNA sequencing, somatic mutation profiles, and gene-level copy number variation data from 1,098 breast invasive carcinoma cases. It contains 1978 images from these 1,098 patients, with 763 tumor samples that include single nucleotide polymorphism (SNP) and copy number variation (CNV) data generated using the Affymetrix 6.0 SNP array, alongside somatic mutation information obtained from the Illumina sequencing platform. Data sources include the Genomic Data Commons (GDC) Data Portal, Pan-Cancer Atlas, and The Broad Institute’s TCGA GDAC Firehose. The dataset is publicly available through both the GDC Data Portal and the Cancer Imaging Archive (TCIA). Figure 3: Examples of SVS image from TCGA-BRCA dataset 4.2. Data Pre-processing In this study, we undertake a comprehensive pre-processing procedure to prepare histopathology images for deep learning analysis. First, 200 SVS image files were collected from the TCGA-BRCA dataset, including images with BRCA1 or BRCA2 mutations, as well as some without these mutations. These images can be as large as 130,000x99,000 pixels. To facilitate efficient processing and analysis, the images were divided into smaller patches of 1400 x 920 pixels (see Figure 4 for exemples). Then each patch was classified according to the status of the BRCA mutation, distinguishing between the BRCA mutation positive and BRCA mutation negative cases. This classification is critical for investigating the role of BRCA mutations in breast cancer. The dataset was organized according to the BRCA mutation status, ensuring a comprehensive Figure 4: Example of generated patches from SVS images range of examples for model training. Subsequently, the dataset was split into training and validation sets to prepare for model evaluation (see Table 2 for the distribution of images). Table 2 Dataset Statistics for BRCA Mutation Classification Mutation Status Number of SVS Images Number of Patches BRCA1 53 38,849 BRCA2 38 25,526 No BRCA Mutation 109 56,000 Total 200 120375 4.3. Used Metrics and Loss Functions In this study, we use a variety of metrics and loss functions to evaluate and optimize our deep learning models for breast cancer detection and diagnosis. This includes the VQ-VAE model, which employs a specialized loss function. Below, we outline the metrics and loss functions used: 4.4. Loss Functions: • Binary Cross-Entropy Loss: Applied for binary classification tasks, such as distinguishing between cancerous and normal patches. Measures the performance of a classification model with output probabilities between 0 and 1. The formula is: 𝑁 1 ∑︁ LossBCE = − [𝑦𝑖 log(𝑦ˆ𝑖 ) + (1 − 𝑦𝑖 ) log(1 − 𝑦ˆ𝑖 )] (2) 𝑁 𝑖=1 where 𝑦𝑖 denotes the ground truth label and 𝑦ˆ𝑖 is the predicted probability. • VQ-VAE Loss Function: The VQ-VAE model utilizes a specialized loss function that includes three key components: – Reconstruction Loss: Measures how well the decoder reconstructs the input from the quantized representation. It ensures that the reconstructed image is similar to the original input image. The formula is: 𝑁 1 ∑︁ LossRecon = ˆ 𝑖 ‖2 ‖𝑥𝑖 − 𝑥 (3) 𝑁 𝑖=1 where 𝑥𝑖 is the original image and 𝑥 ˆ 𝑖 is the reconstructed image. – Codebook Loss: Encourages the codebook vectors to move closer to the encoder output, ensuring that the quantization process effectively captures the data’s structure. This compo- nent helps to learn a better representation by minimizing the distance between the encoder output and codebook vectors. The formula is: 𝑁 1 ∑︁ LossCodebook = ‖𝑧𝑖 − 𝑒𝑞(𝑧𝑖 ) ‖2 (4) 𝑁 𝑖=1 where 𝑧𝑖 is the continuous latent vector and 𝑒𝑞(𝑧𝑖 ) is the quantized vector. – Commitment Loss: Penalizes the encoder for not committing to a specific codebook vector, promoting stability in the learned representations. This component helps to stabilize the learning process and ensure that the encoder uses the codebook vectors effectively. The formula is: 𝑁 1 ∑︁ LossCommitment = 𝛽 ‖𝑧𝑖 − 𝑒𝑞(𝑧𝑖 ) ‖2 (5) 𝑁 𝑖=1 where 𝛽 is a hyperparameter that controls the weight of the commitment loss term. 4.5. Evaluation Metrics: • Accuracy: Measures the proportion of correctly classified patches out of the total number of patches. The formula is as follows: Number of Correct Predictions Accuracy = (6) Total Number of Predictions • Precision and Recall: Precision measures the proportion of true positive predictions among all positive predictions, while Recall measures the proportion of true positive predictions among all actual positives. These metrics are defined as: True Positives Precision = (7) True Positives + False Positives True Positives Recall = (8) True Positives + False Negatives • F1 Score: The harmonic mean of Precision and Recall, providing a balanced measure of model performance. The formula is: Precision · Recall F1 Score = 2 · (9) Precision + Recall • Area Under the Receiver Operating Characteristic Curve (AUC-ROC): Measures the model’s ability to distinguish between classes across all classification thresholds. A higher AUC value indicates better model performance. The AUC is calculated as the integral of the Receiver Operating Characteristic (ROC) curve: ∫︁ 1 𝐴𝑈 𝐶 = 𝑃 𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 𝑑(𝑅𝑒𝑐𝑎𝑙𝑙) (10) 0 5. Results and discussion The discussion section is dedicated to analyzing and interpreting the results obtained from our experi- ments: 5.1. Clustering Results The clustering process using the DBSCAN algorithm [18] (Density-Based Spatial Clustering of Applica- tions with Noise) aimed to categorize patches into specific groups based on the presence or absence of BRCA mutations. The parameters for DBSCAN were set with eps = 8.7 and min_samples = 180000, guiding the clustering of the latent space representations. Table 3 Results of DBSCAN Clustering Cluster Total Number of Images With BRCA Mutation No BRCA Mutation Cluster 1 91104 17467 73637 Cluster 2 32420 3611 25661 The goal of this clustering was to classify patches according to their status of BRCA mutation. The clustering process identified two distinct categories of clusters. The first cluster contains patches from both images with BRCA mutations and images without these mutations. The second cluster, however, exclusively contains patches from images identified with BRCA mutations. This clustering approach enables a more focused separation, supporting targeted analysis and model training based on the presence or absence of the BRCA mutation. 5.2. BRCA Patch Classification In this section, we present the results of classifying BRCA patches using three different deep learning models: VGG16[19], ResNet [20], EfficientNet [21], and Inception V3 [22]. The classification task involves distinguishing between patches with BRCA mutations and those without. The data set used for the classification of BRCA mutation consists of histopathological image patches, divided into three subsets: training, validation, and testing. Table 4 summarizes the distribution of labels across the training, validation, and test sets. The training set comprises a total of 57,152 samples, with 48,176 labeled as NO_BRCA and 8,976 as BRCA. The validation set includes 11,431 samples, of which 9,639 are labeled as NO_BRCA and 1,792 as BRCA. Finally, the test set contains 14,288 samples, with 12,044 labeled as NO_BRCA and 2,244 as BRCA. The classification results for the detection of BRCA mutation in patches using four different models are presented in Table 5, which outlines key metrics such as precision, AUC, precision, recall and F1 Table 4 Dataset Summary for BRCA Mutation Classification Dataset Total Samples NO_BRCA BRCA Training 64,384 55,408 8,976 Validation 11,431 9,639 1,792 Test 44,555 42,011 2,500 score for both the BRCA and No Mutation classes. These metrics provide a comprehensive evaluation of each model’s performance, highlighting their ability to distinguish between patients with BRCA mutations and those without. Table 5 BRCA Patch Classification Results Model Accuracy AUC Precision Recall F1-Score BRCA No Mutation BRCA No Mutation BRCA No Mutation EfficientNet 98.81% 97.01% 98% 99% 94% 100% 96% 99% VGG16 98.81% 97.17% 98% 99% 95% 100% 96% 99% ResNet 98.71% 97.11% 97% 99% 95% 99% 96% 99% Inception V3 98.94% 97.36% 98% 99% 95% 100% 97% 99% All models exhibited exceptional performance with accuracy that exceeded 98%. Inception V3 achieved the highest accuracy at 98.94%, closely followed by EfficientNet and VGG16, both at 98.81%, while ResNet achieved 98.71%. The Area Under the Curve (AUC) further supports these findings, with all models surpassing the 97% threshold, led by Inception V3 at 97.36%. Detailed precision, recall, and F1-score reveal that Inception V3 consistently outperformed the other models in all metrics. For the BRCA class, Inception V3 achieved a precision of 98%, a recall of 95%, and an F1 score of 97%. For the No Mutation class, Inception V3 reached near-perfect performance, with a recall of 100%, precision of 99%, and an F1-score of 99%. These results highlight Inception V3’s balanced sensitivity (recall) and precision across both classes, making it a reliable model for BRCA mutation detection. Although all models show strong performance, Inception V3 stands out with the best overall metrics in accuracy, AUC, and F1 score. EfficientNet and VGG16 shared similar results, achieving an accuracy of 98.81% and maintaining high precision and recall for both classes. ResNet, although slightly lower in performance compared to the other models, still achieved competitive results with a precision of 97% for the BRCA class and high recall values. The consistently high performance of all models underscores the effectiveness of deep learning architectures for histopathological image classification. However, the slight edge of Inception V3 in both AUC and F1-score suggests that its architecture may be better suited for extracting subtle features in histopathological images related to BRCA mutations, possibly due to its ability to capture multi-scale features. To further analyze the classification performance, confusion matrices for the four mod- els—EfficientNet, VGG16, ResNet, and Inception V3—are illustrated in Figure 5, showing the distribution of true positives, false positives, true negatives, and false negatives for both the BRCA and No Mutation classes. 5.3. SVS Image Classification Results The classification performance of our model for detecting BRCA mutations in histopathological SVS images is summarized in Table 5. The model achieved strong metrics for both the BRCA Mutation and No Mutation classes. Specifically, precision, recall, and F1-score for both classes were balanced, indicating robust classification results. (a) Confusion Matrix - VGG Model (b) Confusion Matrix - ResNet Model (c) Confusion Matrix - EfficientNet Model (d) Confusion Matrix - InceptionV3 Model Figure 5: Comparison of Confusion Matrices for Different Models Table 6 SVS Images Classification Results Model Precision Recall F1-Score BRCA Mutation 90% 90% 90% No Mutation 97% 97% 97% Accuracy 95% AUC 93.27% As shown in Table 6, the model achieved an overall accuracy of 95%, with an AUC score of 93.27%. The high precision and recall for both classes demonstrate the effectiveness of our approach in detecting BRCA mutations, reducing the risk of false positives and false negatives. 5.4. Comparison with Related Work Table 7 presents a comparison of our method with related work in the field of BRCA mutation detection from histopathological images. Our approach, combining VQVAE and DBSCAN with InceptionV3, outperformed previous studies, achieving the highest AUC of 93.27%. Our approach offers several distinct advantages over other methods for the detection of BRCA mutations, primarily due to the integration of advanced unsupervised learning and clustering techniques. Using the VQVAE model, we efficiently encode high-dimensional histopathological images into a Table 7 Comparison with Related Works Reference Dataset Year Methods Metrics Tristan Lazard et al. [11] TCGA 2022 ResNet-18 AUC 71% Xiaoxiao Wang et al. [10] JSPHCM, JSCH 2021 ResNet-18 AUC 79% Kurian et al. [13] TCGA-BRCA 2023 SimCLR 81.34% accuracy Valieris et al. [14] TCGA 2020 ResNet-34 AUC 80% Nam Nhut Phan et al. [12] TCGA-BRCA 2021 2-Step ResNet50,101, VGG16, Xception AUC 92% Our Work TCGA-BRCA 2024 VQVAE, DBSCAN, VGG16, Resnet50, Ef- AUC 93.27% ficientNet, InceptionV3 compact latent space, allowing the extraction of critical features while preserving key image details. Unlike conventional methods, which may struggle to capture subtle variations in tissue morphology, our model’s ability to reconstruct intricate patterns enhances the detection of relevant features. Moreover, the incorporation of DBSCAN for clustering within this latent space adds a significant layer of robustness, effectively grouping similar patterns, and reducing noise. This method ensures that irrelevant or noisy data are filtered out, improving classification accuracy. 6. Conclusion In this study, we present a pioneering deep learning framework designed to predict mutations of the BRCA gene in breast cancer utilizing histopathological images. By integrating Vector Quantized- Variational Autoencoders (VQ-VAE) for effective feature extraction and employing DBSCAN for clus- tering, we have established a robust model that demonstrates superior accuracy in classifying cases as BRCA mutation-positive or negative. This innovative approach surpasses conventional methods and highlights the potential of artificial intelligence to automate complex diagnostic processes within medical imaging. In perspective, our future work will focus on the improvement of data enhancement techniques to further enhance the accuracy of the model in the detection of BRCA mutations. By generating synthetic samples that capture the variability in the expression of the BRCA mutation, our aim is to improve the robustness and generalization of our model. This will be particularly valuable for addressing imbalances in the dataset and improving the classification of rare mutation cases. In addition, our exploration will extend to investigating the roles of other genetic mutations in breast cancer. Declaration on Generative AI During the preparation of this work, the authors used ChatGPT for grammar and spelling checks, as well as paraphrasing. After utilizing this tool, the authors reviewed and edited the content as necessary, taking full responsibility for the final publication. References [1] F. Touazi, D. Gaceb, M. Chirane, S. Herzallah, Two-stage approach for semantic image segmentation of breast cancer: Deep learning and mass detection in mammographic images., in: IDDM, 2023, pp. 62–76. [2] M. Khaled, F. Touazi, D. Gaceb, Improving breast cancer diagnosis in mammograms with progres- sive transfer learning and ensemble deep learning, Arabian Journal for Science and Engineering (2024). [3] F. Touazi, D. Gaceb, N. Boudissa, S. Assas, Enhancing breast mass cancer detection through hybrid vit-based image segmentation model, in: The 6th Conference on Computing Systems and Applications, Algiers, Algeria, 2024, pp. 1–10. [4] R. A. Dar, M. Rasool, A. Assad, et al., Breast cancer detection using deep learning: Datasets, methods, and challenges ahead, Computers in biology and medicine 149 (2022) 106073. [5] N. Petrucelli, M. B. Daly, T. Pal, Brca1-and brca2-associated hereditary breast and ovarian cancer (2022). [6] A. Hodgson, G. Turashvili, Pathology of hereditary breast and ovarian cancer, Frontiers in Oncology 10 (2020) 531790. [7] V. Talwar, A. Rauthan, Brca mutations: implications of genetic testing in ovarian cancer, Indian Journal of Cancer 59 (2022) S56–S67. [8] L. F. Madrigal, M. Y. R. Garcés, F. J. J. Ruiz, Impact of non-brca genes in the indication of risk- reducing surgery in hereditary breast and ovarian cancer syndrome (hboc), Current Problems in Cancer 47 (2023) 101008. [9] S. Zhao, C.-Y. Yan, H. Lv, J.-C. Yang, C. You, Z.-A. Li, D. Ma, Y. Xiao, J. Hu, W.-T. Yang, et al., Deep learning framework for comprehensive molecular and prognostic stratifications of triple-negative breast cancer, Fundamental Research (2022). [10] X. Wang, C. Zou, Y. Zhang, X. Li, C. Wang, F. Ke, J. Chen, W. Wang, D. Wang, X. Xu, et al., Prediction of brca gene mutation in breast cancer based on deep learning and histopathology images, Frontiers in Genetics 12 (2021) 661109. [11] T. Lazard, G. Bataillon, P. Naylor, T. Popova, F.-C. Bidard, D. Stoppa-Lyonnet, M.-H. Stern, E. De- cencière, T. Walter, A. Vincent-Salomon, Deep learning identifies morphological patterns of homologous recombination deficiency in luminal breast cancers from whole slide images, Cell Reports Medicine 3 (2022). [12] N. N. Phan, C.-C. Huang, L.-M. Tseng, E. Y. Chuang, Predicting breast cancer gene expression signature by applying deep convolutional neural networks from unannotated pathological images, Frontiers in oncology 11 (2021) 769447. [13] N. C. Kurian, S. Varsha, A. Patil, S. Khade, A. Sethi, Robust semi-supervised learning for histopathol- ogy images through self-supervision guided out-of-distribution scoring, in: 2023 IEEE 23rd International Conference on Bioinformatics and Bioengineering (BIBE), IEEE, 2023, pp. 121–128. [14] R. Valieris, L. Amaro, C. A. B. d. T. Osório, A. P. Bueno, R. A. Rosales Mitrowsky, D. M. Carraro, D. N. Nunes, E. Dias-Neto, I. T. d. Silva, Deep learning predicts underlying features on pathology images with therapeutic relevance for breast and gastric cancer, Cancers 12 (2020) 3687. [15] A. van den Oord, O. Vinyals, K. Kavukcuoglu, Neural discrete representation learning, CoRR abs/1711.00937 (2017). URL: http://arxiv.org/abs/1711.00937. arXiv:1711.00937. [16] A. Van Den Oord, O. Vinyals, et al., Neural discrete representation learning, Advances in neural information processing systems 30 (2017). [17] A. Thennavan, F. Beca, Y. Xia, S. Garcia-Recio, K. Allison, L. C. Collins, M. T. Gary, Y.-Y. Chen, S. J. Schnitt, K. A. Hoadley, et al., Molecular analysis of tcga breast cancer histologic types, Cell genomics 1 (2021). [18] M. Ester, H.-P. Kriegel, J. Sander, X. Xu, et al., A density-based algorithm for discovering clusters in large spatial databases with noise, in: kdd, volume 96, 1996, pp. 226–231. [19] K. Simonyan, A. Zisserman, Very deep convolutional networks for large-scale image recognition, arXiv preprint arXiv:1409.1556 (2014). [20] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: Conference on computer vision and pattern recognition, 2016, pp. 770–778. [21] M. Tan, Q. Le, Efficientnet: Rethinking model scaling for convolutional neural networks, in: International conference on machine learning, PMLR, 2019, pp. 6105–6114. [22] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, Z. Wojna, Rethinking the inception architecture for computer vision, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 2818–2826.