Two-Stage Approach for Semantic Image Segmentation of Breast
                         Cancer : Deep Learning and Mass Detection in Mammographic
                         images
                         Faycal TOUAZIa, Djamel GACEBa, Marouane CHIRANEa, Selma HERZALLAHa
                                a
                                    LIMOSE laboratory, Computer science department, University M’hamed Bougara, Independence
                                Avenue, 35000 Boumerdes, Algeria

                                          Abstract
                                          Breast cancer is a significant global health problem that predominantly affects women and
                                          requires effective screening methods. Mammography, the primary screening approach,
                                          presents challenges such as radiologist workload and associated costs. Recent advances in deep
                                          learning hold promise for improving breast cancer diagnosis. This paper focuses on early breast
                                          cancer detection using deep learning to assist radiologists, reduce their workload and costs. We
                                          employed the CBIS-DDSM dataset and various CNN models, including YOLO versions V5,
                                          V7, and V8 for mass detection, and transformer-based (nested) models inspired by ViT for
                                          mass segmentation. Our diverse approach aims to address the complexity of breast cancer
                                          detection and segmentation from medical images.
                                          Our results show promise, with a 59% mAP50 for cancer mass detection and an impressive
                                          90.15% Dice coefficient for semantic segmentation. These findings highlight the potential of
                                          deep learning to enhance breast cancer diagnosis, paving the way for more efficient and
                                          accurate early detection methods.

                                          Keywords 1
                                          Breast Cancer, Deep Learning, ViT, NEST, YOLO

                         1. Introduction
                             Breast cancer remains one of the most prevalent diseases among women globally and stands as a
                         leading cause of mortality in gynecological cancers. Across the world, the situation is indeed dire, with
                         one in ten women affected by this disease during their lifetime. It ranks second in overall cancer
                         incidence, following prostate cancer, affecting individuals of all genders. Despite considerable efforts
                         in the form of screening programs aimed at prevention and early detection, there is an urgent need to
                         enhance methods for analyzing mammography images.
                             Mammography represents the unquestionable gold standard for breast exploration, offering
                         unmatched performance in breast cancer surveillance and early detection. Each year, millions of
                         mammograms are produced worldwide for the early screening of breast cancer or to establish a
                         diagnosis to guide therapeutic interventions. However, the interpretation of these images remains a
                         major challenge for healthcare professionals, as they provide complex radiological information that is
                         challenging to fully exploit through human expertise, which relies on visual interpretation and
                         experience.
                             Confronted with this challenge, the development of dedicated software for mammography image
                         analysis becomes imperative to optimize their utilization for the benefit of both patients and physicians.
                         A more suitable method of interpretation is required to enable earlier detection and more effective
                         management of the disease.

                         IDDM’2023: 6th International Conference on Informatics & Data-Driven Medicine, November 17 - 19, 2023, Bratislava, Slovakia
                         EMAIL: f.touazi@univ-boumerdez.dz (A. 1); d.gaceb@univ-boumerdez.dz (A. 2), ch.marouanee@gmail.com (A.                        3),
                         harzallahselma@gmail.com (A. 4)
                         ORCID: 0000-0001-5949-5421 (A. 1);
                                       ©️ 2023 Copyright for this paper by its authors.
                                       Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
                                       CEUR Workshop Proceedings (CEUR-WS.org)


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
    Deep Learning (DL) has revolutionized various real-world domains by providing accurate and
powerful solutions. In the medical field, it also offers promising solutions for the interpretation of
medical images, allowing for highly precise analysis. This paper project focuses on applying Deep
Learning using a range of models and techniques, including transformers, with the goal of detecting
breast cancer in mammograms. To achieve this objective, we integrate the YOLO (You Only Look
Once) model [1] for precise detection of regions of interest (ROI) in mammographic images. Once the
regions of interest are identified, we employ SegNest, an adaptation of the ViT Nest model [2] for
semantic segmentation to perform semantic segmentation of tumors.
    In this paper, our primary objective is to push the boundaries of early breast cancer detection by
harnessing the advancements in Deep Learning and computer vision. To achieve this goal, we will apply
techniques of object detection and semantic segmentation to effectively localize and characterize mass
breast cancer in real mammography images. Moreover, we will explore a hybrid approach that combines
transformers (ViT) with CNN to leverage their respective strengths in breast cancer detection.
    The paper is structured as follows: In Section 2, we delve into the related work in the field, providing
a thorough review of existing literature to establish the context and significance of our research. Section
3 outlines our proposed approach, elucidating the methodology and techniques employed in our study.
The heart of our contribution lies in Section 4, where we present our results and engage in an in-depth
discussion, offering insights and interpretations of the data. Finally, in Section 5, we draw our
conclusions, summarizing the key findings, their implications, and potential avenues for future research.

2. Related works
    In this section, we present an overview of recent studies in the field of breast cancer detection and
tumor segmentation using deep learning techniques.
    For the breast cancer detection, the authors of [3] proposed a two-step method using high-resolution
mammograms. They achieved a significant improvement over Faster R-CNN in terms of detection
accuracy for BI-RADS categories. Hamed Aly et al. [4] applied YOLO-V3 for automated breast mass
detection, achieving a mass detection rate of 89.4\% and high precision for classifying malignant and
benign masses. Prinzi et al. [5] presents an approach to automated breast cancer detection using YoloV5
architecture, which reached an mAP50 of 49.8% on CBIS-DDSM dataset.
    For Breast Tumor Segmentation, Soltani et al. [6] employed Mask R-CNN, reporting a promising
performance with high precision of 0.75%, recall of 0.80%, and F1 score of 0.825%. Yu et al. [7]
introduced Dense-Mask R-CNN, which surpassed the original Mask R-CNN in breast mass detection
on the CBIS-DDSM dataset, with an average precision (AP) of 0.65.
    Among the approaches based on transformers we cite the work of Liu et al. [8] introduced TrEnD,
an encoder-decoder model based on transformers for mammography mass segmentation. They applied
superpixel-based adaptive patch embedding and achieved improved Dice and Intersection over Union
(IoU) scores on the CBIS-DDSM and INBreast datasets. Su et al. [9] developed a YOLO-LOGO model
for breast mass detection and segmentation in digital mammograms. Their model effectively combined
mass detection and segmentation using YOLOV5L6 and a Vision Transformer (ViT), showing
promising results that outperformed other segmentation models. They trained their model on CBIS-
DDSM dataset and they achieved a dice score of 84.49%.
    Prezi et al . [5] proposed an approach for breast cancer detection in CBIS-DDSM mammograms.
The study compares various YOLO architectures namely YOLO V3 YOLO V5 and YOLOV5-
Transformer. Within this architecture, the Transformer block was incorporated into the second-to-last
layer of the backbone network, specifically positioned among the trio of convolutional layers that
precede the spatial pyramid pooling layer. The small YOLOV5 model outperforms others with a mAP
of 0.621.
    As summary of this related works, the hybridization of YOLO with Vision Transformer (ViT)
represents a promising avenue for breast cancer detection and tumor segmentation, as evidenced by the
compelling results obtained in the existing literature. This fusion of YOLO and ViT architectures has
consistently demonstrated superior performance in various studies, underscoring its potential to
enhance both mass detection and segmentation tasks.
3. Proposed approach
   In this section, we describe our proposed approach of the detection and diagnosis of breast cancer,
based on deep learning. The proposed approach focuses specifically on the detection of breast masses
within the context of breast cancer. It is a holistic approach that combines mass detection stage using
YOLO architecture and segmentation stage using SegNesT architecture. By integrating these two stages
(see Figure 1), this approach seamlessly integrates different aspects of deep learning to create a more
holistic and potentially more effective diagnostic system for patients.


            (a)                                                                               (b)


                                                                                              (c)
             (d)


Figure 1: The framework of the proposed approach. (a) Original mammogram of the mass, (b)
Detected Region of Interest (ROI) of the mass, (c) ROI of the detected mass, (d) Binary mask segmented
from the ROI of the mass

3.1.    Breast mass cancer detection based on YOLO model
   At this level, a comprehensive comparative study of common object detection methods is conducted.
Among the various approaches examined, YOLO [1] emerged as a promising choice due to its
advanced real-time object detection performance. In the first phase of proposed approach for breast
cancer detection (detected region of interest ROI of the mass), three most recent versions of the YOLO
architecture V5 [10], V7 [11] and V8 [12] are used and compared (see Figures 2, 3 and 4 for architecture
details).
Figure 2: YOLO V5 architecture [13]


Figure 3: YOLO V7 architecture [14]
Figure 4: YOLO V8 architecture [15]

3.2.    Breast cancer segmentation based on SegNesT architecture
   Once these regions of interest have been identified (in the first stage), we applied, in the second
stage, image segmentation using the SegNesT model. This model is a customized version of the ViT
NEST [2], adapted specifically for effective segmentation tasks. SegNesT excels in precisely outlining
the contours of relevant structures, thereby enhancing lesion characterization.
   This architecture adopts a hierarchical approach based on the Transformer architecture for image
processing. The workflow start from data pre-processing, where an image and its corresponding mask
(label) are fed into the model. The image is initially partitioned into patches, which facilitates the
capture of local details while retaining a global image representation and accommodating different
resolutions. Subsequently, the model employs multiple hierarchical NesT levels to capture information
across various scales. Each hierarchical level comprises a pooling layer, a convolutional with
normalization layer, a position embedding layer and a transformer layer to model feature dependencies.
Ultimately, this model can represent intricate information at multiple resolutions. Finally, it employs a
deprojection operation (un-patchify) to reconstruct the image (mask) based on the extracted features
(see Figure 5).


Figure 5 : SegNest architecture

   Our architectural design encompasses three primary components:

    • NesT (Nested Transformers): The NEST VIT architecture encompasses five crucial
      components for comprehensive image analysis. Firstly, it initiates with Patch Embedding,
      dividing the input image into smaller patches and transforming them into embeddings,
      facilitating the processing of both local and global information. The architecture then operates
      across multiple Hierarchical Levels, focusing on feature extraction at various scales, utilizing
      self-attention mechanisms and feed-forward networks to enhance feature representations. To
      maintain spatial awareness, Positional Embeddings are incorporated and added to patch
      representations. Following feature extraction, a Feature Refinement stage refines feature maps
      using convolutional layers, effectively eliminating artifacts and enhancing visual quality
      through the following steps:
              • Linear Layer
              • Unpatchify Layer
              • Convolution Layer 1: Kernel size = 9x9
              • Convolution Layer 2: Kernel size = 5x5
              • Convolution Layer 3: Kernel size = 3x3
              • MaxPooling Layer: Kernel size = 3x3

       Finally, an Image Reconstruction stage rearranges feature representations into the original
       image format, ensuring a coherent and visually appealing final output.
     • Mask Reconstruction: The second component focuses on the reconstruction of the image
       itself. It takes the embedding vectors generated by the "NesT" part and arranges them in a grid
       of patches to reconstruct the segmented image or mask.
     • CNN Block: The third and essential component comprises a CNN block that contributes
       significantly to the overall architecture. This block includes three convolution layers.

               •   Convolution Layer 1: Kernel size = 9x9
               •   LeakyReLU Layer
               •   Convolution Layer 2: Kernel size = 5x5
               •   LeakyReLU Layer
               •   Convolution Layer 3: Kernel size = 3x3

        Padding is applied in these three layers to maintain the image size after convolution. This CNN
        block effectively eliminates any blocking artifacts that might be present in the reconstructed
        image from the NesT component, resulting in a smoother and visually appealing final output.

4. Experimentations and results
4.1. Dataset used in this work
    In this work we have chosen the CBIS-DDSM dataset [16], a subset of the Digital Database for
Screening Mammography (DDSM) [17], is a valuable resource for breast cancer research. It stands out
due to its complexity, encompassing diverse digital mammography images of both normal and
abnormal cases. These images are rich in details and annotations, making it a challenging dataset for
tasks like lesion detection and classification. The dataset's complexity arises from the presence of subtle
lesions, varying image qualities, and diverse lesion types. We used 1253 images for the train set and
363 for the test set.


Figure 6: CBIS DDSM images examples
4.2.    Data pre-processing
   As part of our implementation, we applied some preprocessing techniques to prepare our dataset for
use, we summarize them in the following points:
       • Image croppe
        We have applied image cropping to focus on a specific region of interest (ROI) within the
        image. Our approach involves utilizing the mask images supplied within the dataset, allowing
        us to extract and crop the white regions from the original images.
       • Resize
        We resized all cropped images to a size of 224×224 px to fit the model input.
       • Image enhancement using CLAHE method
        We use this technique to improves the visibility of details in an image by enhancing the contrast.
        It does this by redistributing the intensity values in a way that ensures a more uniform
        distribution of pixel values, thereby making both dark and bright regions more distinguishable.


                                            (a)                     (b)
Figure 7: Examples of applying CLAHE: (a): input image, (b): CLAHE result.

       • Normalization
        The normalization process consisted of subtracting the minimum value from each pixel then
        divide by the range of pixel values (maximum minus minimum). This scaling ensured that pixel
        values were within the desired range to improve performance, avoid numerical instabilities and
        allow consistent comparisons between pixel values. We applied normalization to the images
        input by scaling their pixel values within a normalized range of 0 to 1.

4.3.    Used metrics and loss functions
   Here we present the different metrics and loss functions used to train and evaluate our models.

   Intersection over union (IoU): is another evaluation metric used to assess the quality of
segmentations. It is calculated as the ratio of the intersection area between the predicted mask and the
reference mask to the union area of the two masks. The IoU is given by the formula:
                                                    |𝐴 ∩ 𝐵|
                                            𝐼𝑜𝑈 =                                                     (1)
                                                    |𝐴 ∪ 𝐵|
   Dice score (DSC): is an evaluation measure used to assess the similarity between two sets, often
used to assess the quality of medical image segmentations. For two sets A and B, the Dice Score is
calculated                                       as                                       follows:
                                               2 ∗ |𝐴 ∩ 𝐵|
                                        𝐷𝑆𝐶 =                                                   (2)
                                                |𝐴| + |𝐵|

   where A and B represent the cardinalities of sets A and B respectively.

   Dice loss: is a loss function used for training segmentation models, especially for tasks where
segmentation is represented by a binary mask. The Dice Loss is defined as the inverse of the Dice Score.
The objective is to minimize this loss to improve the quality of the segmentation.

                                           ℒ𝐷𝑖𝑐𝑒 = 1 − 𝐷𝑆𝐶                                              (3)
    Binary cross-entropy (BCE): is a commonly used loss function for training binary classification
models. It is used when each example can belong to only one class. It measures the distance between
the model’s predictions and the true labels (ground truth). The binary cross-entropy is defined as
follows:

                          ℒ𝐵𝐶𝐸 = −(𝑦 ∗ 𝑙𝑜𝑔(𝑝) + (1 − 𝑦) ∗ 𝑙𝑜𝑔(1 − 𝑝))                                   (4)
   where y is the true binary label (0 or 1) and p is the probability predicted by the model for this label.

   Focal binary cross-entropy (Focal loss): is a specialized variant of the binary cross-entropy loss
function. The Focal Loss was introduced to address the problem of training deep neural networks on
imbalanced datasets, where the model may struggle to effectively learn from the minority class
examples.
   The main idea behind Focal Binary Cross-Entropy is to down-weight the loss contribution of easy-
to-classify examples and focus more on the hard-to-classify ones. It does this by introducing two key
parameters:𝛾) and 𝛼.
                                   𝐹𝑜𝑐𝑎𝑙                      𝛾
                                  ℒ𝐵𝐶𝐸   = −𝛼𝑡𝑖 (1 − 𝑝𝑡𝑖 ) log(𝑝𝑡𝑖 )                                    (5)
   Where:
                                                𝑝       if 𝑦 = 1
                                       𝑝𝑡 = {                                                           (6)
                                                1−𝑝     otherwise
                                                𝛼       if 𝑦 = 1
                                      𝛼𝑡 = {                                                            (7)
                                                1−𝛼     otherwise
   Combined loss function (Combo Loss): is a composite loss function that simultaneously minimizes
the Dice loss and a modified version of the cross-entropy loss. Its formula is giving as follows:

                                ℒ𝑐𝑜𝑚𝑏 = 𝛼 ⋅ ℒ𝐵𝐶𝐸 + (1 − 𝛼) ⋅ ℒ𝐷𝑖𝑐𝑒                                      (8)
   We can also consider a focal version of this combined loss function as follow:
                                𝐹𝑜𝑐𝑎𝑙        𝐹𝑜𝑐𝑎𝑙
                               ℒ𝑐𝑜𝑚𝑏  = 𝛼 ⋅ ℒ𝐵𝐶𝐸   + (1 − 𝛼) ⋅ ℒ𝐷𝑖𝑐𝑒                                    (9)
    Mean Average Precision (mAP) : is commonly used to analyze the performance of object detection
and segmentation systems. In our work is used to evaluate mass detection models. It compares the
ground-truth bounding box to the detected box. The higher the score, the more accurate the model is in
its detections. It is calculated using the following formula :
                                                        𝑛
                                                      1
                                             𝑚𝐴𝑃 = ∑ 𝐴𝑃𝑘                                        (10)
                                                     𝑛
                                                      𝑘=1
   Where n is the number of classes and APk is the Average Precision of the classe k.

                                                   𝑇𝑃(𝑘)
                                        𝐴𝑃𝑘 =                                                        (11)
                                                𝑇𝑃(𝑘) + 𝐹𝑃(𝑘)

    with TP(k) is the True Positive rate of the class k: The model predicted a label of class k and matches
correctly to the ground truth. FP(k) is the False Positive rate of the class k: The model predicted a label
of class K, but it is not a part of the ground truth. mAP50 is Mean Average Precision calculated for
the IoU threshold of 0.5.

4.4.    Results and discussion
   Discussion is dedicated to the analysis and discussion of the outcomes obtained from our
experiments, focusing on two main aspects: Object Detection Results and Segmentation Results.

4.4.1. Mass detection results
   In this subsection, we delve into the performance of our object detection model. Table 1 presents the
performance of different models: YOLOv5 Small (V5 S), YOLOv7 (V7 X), and YOLOv8 Medium (V8
M). Analyzing the mAP50 score on the test set, YOLOV8 Medium achieved the best performance with
a score of 59%, surpassing YOLOv5 Small with 46% and YOLOv7X with 51%. These results suggest
that YOLOV8 Medium displayed the highest performance among the three tested models.


Table 1
Mass detection results using different YOLO models

             Model                               map50                             Image Size
           YOLO V5 S                              46 %                             1280×1280
           YOLO V7 X                              51%                               640×640
           YOLO V8 M                              59%                               640×640

    It's worth noting that despite the larger size of YOLOv5 Medium with an image resolution of
640×640 compared to YOLOv5 Small, it did not achieve satisfactory results. This can be attributed to
the fact that YOLOv5s was trained on high-resolution images (1280×1280), whereas YOLOv5m was
trained on lower-quality images. Unfortunately, it was not possible to evaluate the performance of
YOLOv5m at a resolution of 1280×1280 due to hardware limitations. Therefore, the results of
YOLOv5m at this resolution are not considered in this comparison.
    Table 2 presents a comparison of various models and their performance metrics in the context of
breast cancer mass segmentation.

Table 2
Comparison with related works

                      Paper                       Model                       mAP
                  Prinz et al. [5]               YOLO V5s                    49,8%
                    Su et al. [9]                YOLO V5s                     59%
                        Our                      YOLO V8                      59%
                   Su et al. [9]                YOLO V5L6                     65%
   In the study by Su et al. [9], they achieved a success rate of 59% using YOLOv5 after 1000 training
epochs. In contrast, our approach also reached a 59% mAP rate, but with a different approach that
required only 300 training epochs. This difference in the number of epochs suggests a relative efficiency
in our approach. They achieved with YOLOv5L6 an mAP50 of 65%. However, due to hardware
constraints, we were unable to use this version.
   Furthermore, when compared to the article by Prinzi et al. [5], which utilized YOLOv5s with data
augmentation and obtained a result of 49.8% in mAP50, our approach yielded better results.

4.4.2. Segmentation results
   The following subsection is dedicated to the analysis of our segmentation model's performance. The
results of SegNet are displayed in the table above with the different loss functions used:

Table 3
Mass segmentation results using SegNest model

                                        Loss function                           Dice
                                            ℒ𝐵𝐶𝐸                                75%
                                           ℒ𝑐𝑜𝑚𝑏                               81.2%
                        ℒ𝐹𝑜𝑐𝑎𝑙
                         𝑐𝑜𝑚𝑏 (𝑑𝑖𝑐𝑒𝑤𝑒𝑖𝑔ℎ𝑡 = 1.0, 𝑓𝑜𝑐𝑎𝑙𝑤𝑒𝑖𝑔ℎ𝑡 = 1.0) )
                                                                              89.99%
                        ℒ𝐹𝑜𝑐𝑎𝑙
                          𝑐𝑜𝑚𝑏 (𝑑𝑖𝑐𝑒𝑤𝑒𝑖𝑔ℎ𝑡 = 0.5, 𝑓𝑜𝑐𝑎𝑙𝑤𝑒𝑖𝑔ℎ𝑡 = 0.5 )
                                                                              90.15%

   In our research, we explored different loss functions in our SegNesT model training. Initially, we
have used the binary cross-entropy (BCE) loss function, our model achieved a Dice score of 75%.
   However, when we adopted the combined loss function, our performance improved significantly,
reaching a Dice score of 81.2%. To further enhance our model's performance, we introduced the
Combined Focal Loss, using dice_weight=1.0 and focal_weight=1.0, which resulted in even better
performance, with an impressive Dice score of 88.99%. Finally, by adjusting the weights to
dice_weight=0.5 and focal_weight=0.5, our model achieved its best result, with a remarkable Dice score
of 90.15%. These results underscore the critical importance of selecting the right loss function in
improving the performance of our SegNesT model.
   Table 4 presents a comparison of various models and their performance metrics in the context of
breast cancer mass segmentation.

Table 4
Comparison with related works

                     Paper                      Model                       Dice
             Sharif Amit Kamran et
                                             Swin-SFTNet                   24.13%
                     al. [34]
              Bouzar-Benlabiod et
                                        U-Net SE-RestNet-101                 75%
                     al. [18]
              Yuehang Wang et al.
                                            AM-MSP-cGAN                    84.49%
                       [19]
             Dongdong Liu et al. [8]            TrEnD                      89.48%
                 Our approach                  SegNest                     90.15%


   Our SegNesT model has achieved the higher dice score among similar literature works. Bouzar-
Benlabiod et al. [18] had used Attention U-net, obtained a Dice score of 75%. Yuehang Wang et al.
[19] has achieved a dice score of 84.49% using a hybrid approach combining YOLO and ViT.
Dongdong Liu et al. [8] also with a ViT-based model named TrEnD achieved a Dice score of 89.48%
was achieved. It is clear that our SegNesT model has outperformed the results of related works even
transformer-based models, achieving the highest Dice score of 90.15%. This superior performance
highlights the effectiveness of our approach compared to previous methods in the field of medical image
segmentation.
    Our NesT based approach plays a crucial role in addressing the quadratic complexity issue of full
self-attention in vision transformers. By introducing a hierarchical nested structure and incorporating
block aggregation, NesT effectively improves data efficiency and accuracy compared to previous
methods within the realm of ViT-based approaches. This progress positions our approach favorably
compared to other ViT based approaches.
    The block aggregation mechanism plays a central role in promoting effective inter-block
communication, thereby diminishing the necessity for full self-attention at each layer. This
simplification of the architectural design not only amplifies the effectiveness of training with smaller
datasets but also demonstrates its utility as model size scales up, illustrating NesT's enhanced efficiency
in handling larger models.

4.4.3. Result samples
   In this section, we provide a comprehensive showcase of result samples obtained from our study.
These samples serve as illustrative examples of the outcomes generated by our research.
   Figure 8 showcases the qualitative results of our model's detection task, offering a visual
representation of its performance in identifying and localizing objects of interest within the dataset.
These results provide valuable insights into the accuracy and precision of our model's detection
capabilities, contributing to a comprehensive assessment of its overall effectiveness.
   Figure 9 presents the qualitative results of our model's segmentation task, underlining the
remarkable resemblance between the ground truth mask and the predicted mask generated by our
SegNest model. This compelling similarity confirm the precision and fidelity of our model's
segmentation capabilities


Figure 8: Masses detected with YOLOV8
                            (a)                        (b)                         (c)
Figure 9: Example of segmentation result with SegNest. (a) Regions of Interest (ROI), (b) Ground Truth,
(c) Predicted Mask


5. Conclusion
    In this paper, we highlight the potential use of object detection and semantic segmentation in the
field of breast cancer detection and diagnosis. We have explored various aspects of deep learning,
including mass detection using YOLO versions 5, 7, and 8, as well as breast mass cancer segmentation
using our proposed SegNest architecture, based on ViT Nest. The findings indicate the efficacy of these
methodologies, as YOLO V8 M achieved the highest mean average precision (mAP) of 59% among
the YOLO models for mass detection. Additionally, our SegNest model demonstrated outstanding
performance in mass semantic segmentation, achieving a Dice loss of 90.15%. These approaches have
demonstrated their effectiveness in identifying anomalies and tumors in mammographic images,
offering promising avenues for improving the accuracy of breast cancer diagnoses.
    While our findings are promising, it's important to outline that our experimentation was conducted
with a limited dataset. To enhance the performance and generalizability of our models, we foresee
numerous directions for future research and development. These include expanding our dataset to
encompass a more diverse range of cases, refining model architectures, and exploring transfer learning
techniques from other medical imaging domains. These steps will be crucial in ensuring that the benefits
of deep learning in breast cancer detection can be realized more broadly, ultimately benefiting both
patients and healthcare professionals.
6. References


  [1]      J. Redmon, S. Divvala, R. Girshick and A. Farhadi, "You only look once: Unified, real-
       time object detection," in Proceedings of the IEEE conference on computer vision and pattern
       recognition, 2016.
  [2]      Z. Zhang, H. Zhang, L. Zhao, T. Chen, S. Ö. Arik and T. Pfister, "Nested hierarchical
       transformer: Towards accurate, data-efficient and interpretable visual understanding," in
       Proceedings of the AAAI Conference on Artificial Intelligence, 2022.
  [3]      B. Ibrokhimov and J.-Y. Kang, "Two-stage deep learning method for breast cancer
       detection using high-resolution mammogram images," Applied Sciences, vol. 12, p. 4616,
       2022.
  [4]      G. H. Aly, M. Marey, S. A. El-Sayed and M. F. Tolba, "YOLO based breast masses
       detection and classification in full-field digital mammograms," Computer methods and
       programs in biomedicine, vol. 200, p. 105823, 2021.
  [5]      F. Prinzi, M. Insalaco, A. Orlando, S. Gaglio and S. Vitabile, "A Yolo-Based Model for
       Breast Cancer Detection in Mammograms," Cognitive Computation, p. 1–14, 2023.
  [6]      H. Soltani, M. Amroune, I. Bendib and M. Y. Haouam, "Breast cancer lesion detection and
       segmentation based on mask R-CNN," in 2021 International Conference on Recent Advances
       in Mathematics and Informatics (ICRAMI), 2021.
  [7]      H. Yu, R. Bai, J. An and R. Cao, "Deep learning-based fully automated detection and
       segmentation of breast mass," in 2020 13th International Congress on Image and Signal
       Processing, BioMedical Engineering and Informatics (CISP-BMEI), 2020.
  [8]      D. Liu, B. Wu, C. Li, Z. Sun and N. Zhang, "TrEnD: A transformer-based encoder-decoder
       model with adaptive patch embedding for mass segmentation in mammograms," Medical
       Physics, vol. 50, p. 2884–2899, 2023.
  [9]      Y. Su, Q. Liu, W. Xie and P. Hu, "YOLO-LOGO: A transformer-based YOLO
       segmentation model for breast mass detection and segmentation in digital mammograms,"
       Computer Methods and Programs in Biomedicine, vol. 221, p. 106903, 2022.
  [10]     G. Jocher, "YOLOv5 by ultralytics," Released date, p. 5–29, 2020.
  [11]     C.-Y. Wang, A. Bochkovskiy and H.-Y. M. Liao, "YOLOv7: Trainable bag-of-freebies
       sets new state-of-the-art for real-time object detectors," in Proceedings of the IEEE/CVF
       Conference on Computer Vision and Pattern Recognition, 2023.
  [12]     G. Jocher, A. Chaurasia and J. Qiu, "YOLO by Ultralytics," URL: https://github.
       com/ultralytics/ultralytics, 2023.
  [13]     D. Dluznevskij, P. Stefanovič and S. Ramanauskaite, "Investigation of YOLOv5 Efficiency
       in iPhone Supported Systems.," Baltic Journal of Modern Computing, vol. 9, 2021.
  [14]     S. Zhou, K. Cai, Y. Feng, X. Tang, H. Pang, J. He and X. Shi, "An Accurate Detection
       Model of Takifugu rubripes Using an Improved YOLO-V7 Network," Journal of Marine
       Science and Engineering, vol. 11, p. 1051, 2023.
  [15]     Ultralytics, "GitHub Issue 189 - Ultralytics," 2023. [Online]. Available:
       https://github.com/ultralytics/ultralytics/issues/189.
  [16]     R. S. Lee, F. Gimenez, A. Hoogi, K. K. Miyake, M. Gorovoy and D. L. Rubin, "A curated
       mammography data set for use in computer-aided detection and diagnosis research," Scientific
       data, vol. 4, p. 1–9, 2017.
  [17]     M. Heath, K. Bowyer, D. Kopans, P. Kegelmeyer Jr, R. Moore, K. Chang and S.
       Munishkumaran, "Current status of the digital database for screening mammography," in
       Digital Mammography: Nijmegen, 1998, Springer, 1998, p. 457–460.
[18]     L. Bouzar-Benlabiod, K. Harrar, L. Yamoun, M. Y. Khodja and M. A. Akhloufi, "A novel
     breast cancer detection architecture based on a CNN-CBR system for mammogram
     classification," Computers in Biology and Medicine, vol. 163, p. 107133, 2023.
[19]     Y. Wang, S. Wang, J. Chen and C. Wu, "Whole mammographic mass segmentation using
     attention mechanism and multiscale pooling adversarial network," Journal of Medical
     Imaging, vol. 7, p. 054503–054503, 2020.
[20]     G. Ayana, K. Dese, Y. Dereje, Y. Kebede, H. Barki, D. Amdissa, N. Husen, F. Mulugeta,
     B. Habtamu and S.-w. Choe, "Vision-Transformer-Based Transfer Learning for Mammogram
     Classification," Diagnostics, vol. 13, p. 178, 2023.
[21]     M. Cantone, C. Marrocco, F. Tortorella and A. Bria, "Convolutional Networks and
     Transformers for Mammography Classification: An Experimental Study," Sensors, vol. 23, p.
     1229, 2023.
[22]     H. Sun, C. Li, B. Liu, Z. Liu, M. Wang, H. Zheng, D. D. Feng and S. Wang, "AUNet:
     attention-guided dense-upsampling networks for breast mass segmentation in whole
     mammograms," Physics in Medicine & Biology, vol. 65, p. 055005, 2020.
[23]     S. A. Kamran, K. F. Hossain, A. Tavakkoli, G. Bebis and S. Baker, "SWIN-SFTNet: Spatial
     Feature Expansion and Aggregation using Swin Transformer For Whole Breast micro-mass
     segmentation," arXiv preprint arXiv:2211.08717, 2022.