<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>M. Hosseinipour);</journal-title>
      </journal-title-group>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Challenges in Image Translation for Contrast-Enhanced Mam mography using Generative Advesarial Networks</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Mohammad Hosseinipour</string-name>
          <email>mohammad.hosseinipour@studenti.unipd.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Luca Bergamin</string-name>
          <email>bergamin@math.unipd.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Harel Kotler</string-name>
          <email>harel.kotler@ioveneto.it</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Gisella Gennaro</string-name>
          <email>gisella.gennaro@ioveneto.it</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Fabio Aiolli</string-name>
          <email>aiolli@math.unipd.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Workshop</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Mathematics, University of Padua</institution>
          ,
          <addr-line>Via Trieste, 63, Padua, 35122</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Istituto Oncologico Veneto IOV - IRCCS</institution>
          ,
          <addr-line>Via Gattamelata 64, Padua, 35128</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2024</year>
      </pub-date>
      <volume>000</volume>
      <fpage>0</fpage>
      <lpage>0002</lpage>
      <abstract>
        <p>Medical imaging is a cornerstone of modern healthcare, facilitating early diagnosis and the development of eficient treatment plans. Breast imaging includes diferent imaging modalities, including mammography and MRI, each encompassing unique information. Unfortunately, improving diagnostic performance can be accompanied by an increase in patient-related risks. Specifically, Contrast-enhanced mammography (CEM) ofers better performance while exposing women to the risk of adverse reactions from the contrast agents used for it. To reduce these risks, deep learning solutions have become one of the promising research frontiers in recent years. In image-to-image translation, a mapping function is learned to transform a given image from a source domain to a target domain. In medical imaging, the most common solutions are based on GANs, such as pix2pix. When applied to CEM, we found that pix2pix encounters specific challenges due to low data quality, insuficient model capacity, and domain-derived requirements. Thus, these models have low performance out-of-the-box. In this paper, we highlight these specific challenges, propose tailored evaluation strategies, and present preliminary results on a novel dataset, showcasing the need for specialized approaches in medical imaging translation.</p>
      </abstract>
      <kwd-group>
        <kwd>Generative</kwd>
        <kwd>medical imaging</kwd>
        <kwd>generative adversarial networks</kwd>
        <kwd>AI in healthcare</kwd>
        <kwd>image translation</kwd>
        <kwd>breast cancer detection</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Medical imaging is a key pillar of modern medicine. It provides the foundation for diagnostics, the
process of identifying and characterizing a disease, monitoring, and treatment planning. Medical breast
imaging includes diferent imaging methods, such as mammography, ultrasound, and MRI [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
      </p>
      <p>
        Mammography is an X-ray based method that projects the breast into a 2D image. Since the breast
is a 3D object, overlapping tissues may mask underlying anomalies or generate false ones when it is
projected to 2D. This is further emphasized in dense breasts, typical in younger women, where the
elevated amount of breast tissue increases the overlap and reduces diagnostic performance [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
      <p>
        Contrast-enhanced mammography (CEM) is a method developed to overcome this challenge. CEM
uses intravenous iodinated contrast agents and energy subtraction to increase anatomical contrast
and better represent potential malignancies, thus making it particularly strong in detecting masses in
women who have dense breast tissue [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. As demonstrated in figure 1, a CEM exam results in two main
images used by radiologists to diagnose the breast [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]:
• 1. Processed low-energy (pLE) image: a mammography-equivalent image that does not show
the enhancement of the contrast media [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
      </p>
      <p>• 2. Dual-energy subtraction (DES) image: the enhancement showing image.</p>
      <p>CEUR</p>
      <p>ceur-ws.org
(a) pLE Image (Mammography Equivalent)
(b) DES Image (Contrast Enhanced)</p>
      <p>Despite its advantages, CEM still requires an increased radiation dose compared to standard
mammography and exposes the patient to the risk of adverse reactions related to the contrast agents [6, 7].
Additionally, some women who could benefit from CEM cannot undergo the procedure due to
limitations for contrast agent use, such as renal disease [8]. We argue that reducing or eliminating these
risks without compromising diagnostic performance could lead to broader adoption. Furthermore, this
could allow women who currently cannot benefit from CEM to access its diagnostic capabilities, and
potentially enable its use in general breast cancer screening programs in the future.</p>
      <p>An important technique that holds promise in reducing these risks is image-to-image translation. In
this technique, models learn the relationships between features in an image from the source domain
(e.g., style, structure, or content) and how to associate them with corresponding features in the target
domain [9]. Applying it to CEM holds the potential to create virtual contrast-enhanced images without
the use of iodinated contrast agents.</p>
      <p>Significant work has been done in the field of medical image translation using Generative Adversarial
Networks (GANs). GANs have been applied in recent years in image-to-image translation in medical
imaging, particularly in breast imaging. An example of that ability is MammoGANesis, a GAN-based
framework that can synthesize mammograms [10]. Thanks to its demonstrated strong performance
across similar tasks across the various imaging techniques, we believe GANs could be used to translate
pLE images to DES images with contrast-enhancement [11].</p>
      <p>While GANs excel at generating general images, we observed they can fail to reproduce fine-grained,
location-specific details. Given the importance of these features in medical settings, we incorporated
attention modules to enhance the representation of local features [12]. This approach could ofer a
better chance at generating contrast where it is needed and suppressing it where it is not required,
which is critical for clinical use.</p>
      <p>In this paper, we experimentally demonstrate that state-of-the-art attention modules are
outperformed by the older U-Net based GAN architecture when performing CEM image-to-image translation.
Therefore, we propose two novel solutions to overcome this issue, namely an attention-based
improvement on the generator architecture and a tailor-made loss function for CEM images to promote the
reproduction of bright details in the image.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>GANs have been widely used in various medical image translation tasks. For instance, the MedGAN
framework has demonstrated the capability of GANs to generate realistic medical images across multiple
modalities [9]. In virtual contrast generation for breast imaging, Müller-Franzes et al. used GANs to
enhance the efect of contrast media in contrast-enhanced MRI images [ 13]. This was done in the hope
of reducing the dose of contrast agent used in this imaging method. Since mammograms and CEMs
possess higher resolutions compared to MRIs, we aim to assess the feasibility of applying similar
GANbased techniques to these higher-resolution imaging modalities. Other works in the literature explicitly
consider the creation of high-resolution images, and they can be considered in future extensions of this
work [14].</p>
      <p>While other approaches, such as CycleGAN and difusion models, have emerged in the field, they are
not well-suited to our application. CycleGAN [15] is designed for scenarios with unpaired datasets,
i.e., where there is no clear match between input and output, whereas our study utilizes paired images,
making traditional GANs such as pix2pix a more appropriate choice. Furthermore, although denoising
difusion probabilistic models (e.g., DDPMs) show promise [ 16], GANs are currently more mature and
provide a more trustworthy technology for our purposes. In particular, an existing work considered
a low-dose setting for breast MRI gathered some evidence that GAN-generated images are preferred
over DDPM-generated images by radiologists for the lowest levels of contrast agent [13]. The study
indicates that both models are promising, and they conclude that further development is needed. We
argue that lower performance for DDPMs can be due to a small training set size and higher computing
requirements. In fact, it was observed that GAN-based architectures still work well [17] even with
hundreds of samples. Another important observation concerns the higher inference time required for
difusion models compared to GANs [ 18]; this issue can hinder their usage in real-time applications.</p>
      <p>The Attention U-Net model, which introduces attention gates to focus on relevant regions, has already
demonstrated improved performance in medical image segmentation tasks [12]. Our approach builds
on these advancements by integrating channel, multi-scale channel, and spatial attention mechanisms.
This enhancement enables our model to better capture the complex structures inherent in medical
images, resulting in a more accurate generation of DES images.</p>
      <p>Finally, the idea of reweighting the loss function has already been explored in other contexts, such
as in object detection [19]. Our proposal specifically applies to CEM medical images, and is studied
for reproducing bright details. It can be argued that this technique could also mitigate mode collapse
issues, since healthy tissue is predominant in the pixels of the images, and the discriminator of the GAN
could be fooled most of the times by reconstructing the mode of the data. Many works in the literature,
such as BicycleGAN [20] specifically address this issue, while the present work does not investigate the
mode collapse issue explicitly.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Background</title>
      <sec id="sec-3-1">
        <title>A general overview of the techniques we considered follows.</title>
        <sec id="sec-3-1-1">
          <title>3.1. Overview of GANs</title>
          <p>For our model, we considered Generative Adversarial Networks (GANs). GANs are a class of machine
learning frameworks designed by Goodfellow et al. in 2014 [21]. GANs consist of two neural networks,
a generator and a discriminator, which are trained alternately through adversarial processes. The
generator’s goal is to create data that is indistinguishable from real data, while the discriminator’s goal
is to correctly identify whether the data is real or generated. The interplay between these two networks
allows GANs to generate high-quality synthetic data.</p>
        </sec>
        <sec id="sec-3-1-2">
          <title>3.2. Image-to-image translation using pix2pix</title>
          <p>The pix2pix model was first introduced by Isola et al. in 2017 [ 17]. Their work extends the concept
of GANs to the task of paired image-to-image translation. In this task, the objective is to learn a
mapping function  ∶  →  , where  is a source domain and  is a target domain. This model has been
successfully applied to a wide range of domains and used in image colorization, background removal,
and semantic segmentation [17].</p>
          <p>The pix2pix framework employs a conditional GAN [21, 22, 23], where the generator learns to map
an observed image  ∈ ℝ   × × and a random noise vector  ∈ ℝ  , to  ∈ ℝ   × × ,  ∶ {, } →  to
fool the discriminator; while the discriminator evaluates whether the imageis a real or fake image.
3.2.1. PatchGAN
3.2.2. U-Net
In pix2pix, the discriminator is a PatchGAN [21], which classifies whether each N×N patch in an image
is real or fake. This approach ensures that high-frequency structures are captured in the output images.
The U-Net architecture, proposed by Ronneberger et al. [24], is a popular choice for image segmentation
tasks, especially in the healthcare domain. It consists of an encoder-decoder structure with skip
connections between corresponding layers in the encoder and decoder. These skip connections help in
retaining spatial information that is often lost during downsampling in the encoder [ 24].</p>
          <p>In pix2pix, the most efective generator proposed is the U-Net generator, with the encoder contracting
the input image to a bottleneck layer and the decoder expanding it back to the original size while
merging features from the encoder layers through skip connections. We show the architecture of U-Net
in Fig. 2(a).
3.2.3. Loss Function
The pix2pix architecture employs a composite loss function that consists of two main components: an
adversarial loss and a reconstruction loss.</p>
          <p>• Adversarial Loss (  ): The adversarial loss is the core component of GANs, where the
generator tries to fool the discriminator, and the discriminator tries to distinguish between real
and fake images [17]. In the context of pix2pix, the conditional adversarial loss (  ) is used
to condition the generation process on the input image, ensuring that the generated output is a
plausible transformation of the input. The adversarial loss is defined as:
 
(, ) = 
, [log (,  )] + 
, [log(1 − (, (, )))]
(1)
where  is the generator,  is the discriminator,  is the input image,  is the real output image,
and  is the noise vector.
• Reconstruction Loss ( 1 ): To reduce the distance between two images, the L1 and L2 distances
were investigated. The L2 distance makes the generator create results that are more blurry
compared to the L1 distance, producing sharper images [17]. This loss encourages the generator
to produce images close to the real images in pixel space, promoting accurate reconstruction of
the target output. The L1 loss is defined as:
 1 () =  ,, [‖ − (, )‖
The final loss function for the generator is a combination of the adversarial loss (  
distance loss ( 1 ). The total loss is given by:
  =  
(, ) + 
1 ()
where  is the weighting factor that balances the contribution of the L1.
) and the L1
(2)
(3)</p>
        </sec>
        <sec id="sec-3-1-3">
          <title>3.3. Attention U-Net</title>
          <p>Attention U-Net, proposed by Oktay et al. [12], introduces attention gates to the standard U-Net
architecture. These attention gates allow the model to focus on relevant regions of the image, thereby
improving quality performance. We show the architecture of Attention U-Net in Fig. 2(b).</p>
          <p>Attention gates are inserted into the skip connections of the U-Net, enabling the network to suppress
irrelevant regions and highlight salient features useful for a specific task. This should result in enhanced
model’s sensitivity and prediction accuracy without significantly increasing computational overhead.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Problem definition</title>
      <p>We present the main challenges we found while attempting to solve this problem, give appropriate
context, and discuss a number of mitigation strategies.</p>
      <p>
        1. Data quality
a) Noise: CEM images, and X-ray based medical images in general, are usually afected by
noise [25]. This is because ensuring safety requires limiting the ionizing radiation exposure.
Denoising techniques, such as bilateral filtering [
        <xref ref-type="bibr" rid="ref6">26</xref>
        ], ofer a trade-of between sharpness and
noise removal. While the introduction of noise in input data has been argued to not afect
much the generalization capabilities of deep neural networks due to emergent self-denoising
capabilities [
        <xref ref-type="bibr" rid="ref7">27</xref>
        ], it is yet to be understood whether applying denoising to medical images
improves the capabilities of generative models.
b) Scarsity: as outlined by [17], it is argued that as little as 400 images are needed to train a
good model. This can be crucial for medical applications since the cost of acquisition and
legal requirements can severely afect the size of the dataset. It is still unclear if this can
apply to high-resolution images.
c) Imbalance: due to the diferent settings in which medical data is acquired (e.g., screening
data vs. at-risk patient monitoring), data can be afected by diferent selection biases. The
most important factor is the ratio between negative and positive cases, which can afect the
sensibility and sensitivity of the application.
d) Diversity: diferent conditions can afect each image, ranging from diferent breast densities
to the presence of foreign objects such as breast implants or surgical clips.
2. Model capacity
a) Architecture: Diferent deep learning architectures have been proposed to process medical
data efectively, with the U-Net being the most popular one [ 24]. For generative applications,
it is yet to be understood which kind of deep architecture works best.
b) Objective function: the identification of a diferentiable objective that aligns with the
requirements of radiologists is not defined unanimously. Literature shows that some
objectives, diferent from classic L1/L2 distances, align more with human perception of quality
[
        <xref ref-type="bibr" rid="ref8">28</xref>
        ].
3. Domain-related challenges
a) Reproducing bright areas from low-energy images: bright spots in the DES image
often highlight the presence of lesions. Making sure those areas are preserved increases the
sensitivity of the instrument.
b) Suppressing dark areas from low-energy images: dark areas in the DES image help the
reader not get confused by irrelevant information. Making sure those areas are suppressed
increases the specificity of the instrument.
c) Reproduction of small bright details: specifically in CEM applications, bright spots in
the subtracted image are associated with the presence of lesions. Their size can range from
many centimeters down to a few millimeters in the case of micro-calcifications [
        <xref ref-type="bibr" rid="ref9">29</xref>
        ]. It has
been shown that small details in images can be hard to reproduce in GANs using a standard
MSE loss [
        <xref ref-type="bibr" rid="ref10">30</xref>
        ]. Thus, these small details are likely to be ignored by generative models if not
taken into account.
d) High resolution: CEM images have high resolution (&gt;2048x2048) and high bit depth (12-13
bits). This can severely impact training times. Current works in medical image-to-image
translation do not address these issues [9].
e) Evaluation: the evaluation of medical images can vary diferently from reader to reader.
      </p>
      <p>Thus, nfiding a proper quality metric for generated images is challenging.</p>
      <p>In summary, our present work investigates the following aspects:
1. we show how a state-of-the-art architecture is not efective in this specific task, being
outperformed by its U-Net baseline;
2. we provide a new proposal that mitigates this issue using an attention module, and we explore
diferent U-Net model capacities;
3. we propose a novel loss function that focuses more on reproducing small, bright details in the
image.</p>
      <p>Note that many other solutions could be investigated, such as resampling minority classes, data
augmentation, and data denoising. In this paper, we chose to address the challenges related to the
model architecture and training, and we left other related problems to future works.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Method</title>
      <p>In this section, we cover the main workings of the proposed techniques.</p>
      <sec id="sec-5-1">
        <title>5.1. Inner Attention Module Network (IAMNet)</title>
        <p>
          We aim to enhance the capability of a model to focus on relevant features by incorporating attention
mechanisms. Our work starts with the U-Net architecture and uses some readapted ideas from the
Attention U-Net. To this end, we combine the Convolutional Block Attention Module (CBAM) [
          <xref ref-type="bibr" rid="ref11">31</xref>
          ],
which contains Channel Attention and Spatial Attention, with a third mechanism named Multi-Scale
Channel Attention.
        </p>
        <p>
          Channel Attention Block The Channel Attention Block is designed to highlight important feature
channels, which correspond to particular types of information within an image such as edges, textures,
or colors [
          <xref ref-type="bibr" rid="ref11">31</xref>
          ]. It works by applying two pooling operations, global average pooling and max pooling,
across the entire spatial domain of each channel. These pooling operations produce a channel-wise
descriptor that summarizes the significance of each channel. These descriptors are then passed through
a small neural network, which outputs a set of weights that are applied to the channels via a sigmoid
activation function. The resulting attention map selectively emphasizes the most relevant channels,
allowing the model to focus on key details. The Channel Attention Block is defined as:
ChannelAttention() =  ×  (MLP (AvgPool() ) + MLP (MaxPool() ))
(4)
        </p>
        <p>Where  ∈ ℝ×× × is the input feature map,  is the batch size,  is the number of channels, and
 and  are the spatial dimensions of the input.</p>
        <p>
          Spatial Attention Block While the Channel Attention Block prioritizes feature channels, the Spatial
Attention Block focuses on identifying critical spatial locations within the feature map [
          <xref ref-type="bibr" rid="ref11">31</xref>
          ]. This block
operates by compressing the channel information into a single map, which highlights regions that
contain significant information. By combining the maximum and average values across the channels,
the module generates a spatial attention map. A convolutional layer followed by a sigmoid activation
function is applied to this map, which is then multiplied element-wise with the original feature map.
The result is an enhanced representation that emphasizes the most important spatial regions, enabling
the network to concentrate on key areas in the image, such as regions of interest. The Spatial Attention
Block is defined as:
        </p>
        <p>SpatialAttention() =  × 
(Conv ([AvgPool(), MaxPool()] ))
(5)
Again, where  ∈ ℝ ×× ×</p>
        <p>is the input feature map,  is the batch size,  is the number of channels,
and  and  are the spatial dimensions of the input.</p>
        <p>Multi-Scale Channel Attention Block</p>
        <sec id="sec-5-1-1">
          <title>In addition to the Channel and Spatial Attention blocks, we</title>
          <p>
            incorporated the Multi-Scale Channel Attention Block to enhance feature representation by processing
the input feature map at multiple scales. This approach is akin to viewing an object through magnifying
glasses of diferent strengths, allowing the network to capture both coarse and fine details simultaneously.
In this block, average pooling is applied to the feature maps, reducing spatial dimensions and enabling
the model to focus on important features at each scale. Subsequently, Channel Attention is employed
to highlight significant feature channels, ensuring that the most relevant information is prioritized.
After applying attention mechanisms, the feature maps are upsampled back to the original dimensions.
This process generates multiple attention maps, which are then averaged to create a comprehensive
attention map that enhances relevant features across the entire image [
            <xref ref-type="bibr" rid="ref12 ref13 ref14">32, 33, 34</xref>
            ].

1
∑  (ChannelAttention (AvgPool () ))↑
(6)
Where  ∈ ℝ ×× ×
          </p>
          <p>is the input feature map, with  as the batch size,  the number of channels, and
 and  representing height and width, respectively. The term  denotes the number of diferent scales
applied. AvgPool () represents the average pooling operation at the  -th scale, which reduces the spatial
dimensions of the feature map. The output is then passed through the Channel Attention mechanism,
and  is the sigmoid activation function that produces the attention map. The symbol (⋅)↑ denotes
upsampling the attention-modulated feature map back to the original spatial dimensions using bilinear
interpolation. Finally, the attention maps from all scales are averaged and applied element-wise to the
original input  , allowing the model to emphasize relevant features across multiple scales. By integrating
diferent combinations of attention blocks, we explored a higher-performance generator that employs
an encoder-decoder structure with a focus on integrating attention mechanisms at its bottleneck. We
represent its architecture in Fig. 2(c). This design allows the model to efectively highlight key feature
channels and/or important spatial regions in the feature maps via diferent combinations of attention
blocks. The full inner Attention Module is comprised of three components, as shown in Fig. 3(b).
With IAMNet enhances the model’s ability to pick out relevant information, ultimately boosting its
performance in tasks like segmentation and detection.</p>
        </sec>
      </sec>
      <sec id="sec-5-2">
        <title>5.2. Exponential dampening loss function</title>
        <p>As noted in Section 4, the quality of generated images is dependent on the presence of small, bright
details. Thus, giving equal weight to the fidelity of dark and bright areas is undesirable. In particular,
using the L1 distance between the source and the target image is not a sensible choice due to its
symmetry represented in Fig. 4 (left). Therefore, we provide a novel loss function, defined in Eq.
7.
 1 (;  ) =  ,,
[ ⋅ || − (, )||
1],  =  −(1−)
(7)</p>
        <p>The loss considers the normalized intensity of the target pixel (i.e., between 0 and 1). If the target pixel
is not bright (closer to 0), then the value of the L1 loss term is rescaled using an exponential function.
We show in Fig. 4 (middle) a simplifying example, which shows that dark target pixels predicted as
high are penalized less, while dark pixels are not penalized as much.</p>
        <p>The  hyperparameter can be fixed, selected through tuning, and can be annealed during training.
Note that, when  = 0 , the function matches the original  1 definition. A higher  value makes the
objective further to the original L1 loss (Fig. 4, right).
Conv Block 1
Conv Block 2
Conv Block 3
Conv Block 4</p>
        <p>Output
UpConv Block 1
UpConv Block 2
UpConv Block 3</p>
        <p>UpConv Block 4
Bottleneck
(a) U-Net</p>
        <p>Input
Conv Block 1
Conv Block 2
Conv Block 3
Conv Block 4</p>
        <p>AG 1
AG 2</p>
        <p>AG 3
Bottleneck</p>
        <p>Output
UpConv Block 1
UpConv Block 2
UpConv Block 3</p>
        <p>UpConv Block 4
(b) Attention U-Net
(c) Inner Attention Module
Net ̂</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Evaluation</title>
      <p>and our results.</p>
      <p>In this section, we evaluate our proposal on a benchmark dataset. First, we describe the characteristics
of the dataset. Then, we review the chosen quality metrics. Finally, we report our experimental setting</p>
      <sec id="sec-6-1">
        <title>6.1. Dataset</title>
        <p>
          We obtained the dataset from the Istituto Oncologico Veneto (IOV – IRCCS), which includes images
from 550 patients, resulting in approximately 2000 image pairs of low-energy images and DES. The
low-energy images were initially acquired as raw images, which underwent processing step for standard
contrast adjustment and noise reduction for better object visibility [
          <xref ref-type="bibr" rid="ref15">35</xref>
          ]. This results in a processed
low-energy image (pLE), that is equivalent to standard mammography [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ]. The dataset, provided in
DICOM file format, cannot be made publicly available due to legal requirements. The resolution of the
images is 2850 × 2394 pixels. The dataset went through comprehensive reviews to remove outliers,
artifacts, and abnormalities. Subsequently, the dataset was divided into training and test subgroups in a
95:5 ratio.
        </p>
        <p>Benchmark: We created a benchmark of 11 patients with a mass in at least one of their DES images.
To better understand the performance, we compared them based on L1 distance, L2 distance, ΔCNR, and
Peak Signal-to-Noise Ratio (PSNR) metrics in three diferent settings. The first setting is the segmented
breasts without visible masses, the second setting is the segmented breasts with visible masses, and the
third setting is the segmented masses only.</p>
      </sec>
      <sec id="sec-6-2">
        <title>6.2. Contrast vs. Pixel Value Distance</title>
        <p>We argue that a good contrast-enhanced image is one that efectively shows the contrast between
the mass and the surrounding tissues rather than just having a lower L1 or L2 distance. Although a
generated image may have an average pixel value closer to that of the target image, it may fail to capture
the sensible contrast between the mass and the surrounding tissue, reducing its clinical relevance.
Therefore, we used the ΔCNR metric, which can help us quantitatively compare the models based on
their ability to provide clear contrast for mass visualization.</p>
        <p>Contrast-to-Noise Ratio (CNR): CNR is a quantitative measure commonly used in medical imaging
to evaluate the contrast of a region of interest (ROI), such as a mass, against its surrounding background.
It is defined as the diference in the average pixel intensity between the ROI and its surrounding
background, normalized by the standard deviation of the background. The general equation for CNR
can be expressed as:</p>
        <p>CNR =
 ROI −  background</p>
        <p>background
•  ROI is the mean pixel intensity of the region of interest,
•  background is the mean pixel intensity of the surrounding background,
•  background is the standard deviation of the pixel intensities in the background.</p>
        <p>In our study, we adapt this general definition of CNR to assess the contrast specifically in our three
diferent benchmark settings. To do this, we employ two diferent approaches to compute CNR:
(8)
(9)
• Mass CNR: For the ”mass only” setting, two readers have segmented all the masses and their
surroundings in the real DES images separately, and applied the segmentations to both real and
generated DES images. For each DES image, we compute the CNR as the diference between
the mean pixel intensity of all segmented mass regions and the mean pixel intensity of all the
surrounding regions of the masses, normalized by the standard deviation of the surroundings.
This can be expressed mathematically as:</p>
        <p>CNR =
 mass_regions −  surrounding_regions</p>
        <p>surrounding_regions
masses,
regions of all the segmented masses.
–  mass_regions is the mean pixel intensity of all segmented mass regions,
–  surrounding_regions is the mean pixel intensity of the surrounding regions of all the segmented
–  surrounding_regions is the standard deviation of the pixel intensities in the surrounding
• Breast CNR: For both real and generated DES images, we segment the breast and apply square
patches with patch size = 64 × 64 and stride =
and then averaged across all patches. This method is used for the ”segmented breast with mass”
and ”segmented breast without mass” settings. Mathematically, for each patch on the segmented
4
patch size . The CNR is computed for each patch
breast, the CNR is computed as follows:
–  , is the mean pixel intensity of the  -th target patch (the center patch),
–  , is the mean pixel intensity of the surrounding 8 patches for the  -th target patch,
–  , is the standard deviation of the pixel intensities of the surrounding 8 patches for the  -th
target patch.</p>
        <p>The overall CNR for the entire image is then averaged across all  patches in the segmented
breast region, which can be written as:</p>
        <p>CNR =
 , −  ,</p>
        <p>,
CNR = 1
∑ CNR

where  is the total number of patches in the segmented breast region.
(10)
(11)
(12)
ΔCNR Metric: The ΔCNR metric is defined as the diference between the CNR values of the generated
(fake) DES images and the ground truth (real) DES images in each of our three analytical benchmarking
settings. This can be expressed mathematically as:</p>
        <p>ΔCNR = CNRfake − CNRreal
• CNRreal = Contrast-to-Noise Ratio of the real DES images,
• CNRfake = Contrast-to-Noise Ratio of the generated DES images.</p>
        <p>This metric allows for a quantitative comparison of the mass visibility between real and generated
images at the whole breast and mass only levels.</p>
      </sec>
      <sec id="sec-6-3">
        <title>6.3. Experimental setting</title>
        <p>of epochs.</p>
        <p>We trained our models on a single NVIDIA V100 GPU, with every model taking 12-24 hours to complete
its training, depending on the architecture size. We report in Table 1 the hyperparameters used for the
trained models. To train the models, we used a fixed learning rate for   = 100 epochs, then we linearly
decayed it for   = 200 additional epochs. We employed Adam as optimizer. For models trained using
the exponential dampening function, we initialized the value to  ( = 0) =  , then we linearly decayed it
until  ( =  ) = 0 , using  () = (1 −   ) , where  is the current epoch number and  is the total number</p>
        <p>Name
GAN objective</p>
        <p>Batch size</p>
        <p />
        <p>Epochs (fixed LR)
#Conv. (Attn U-Net)</p>
        <p>Value
Least-square
10
100
100
9</p>
        <p>Name
Resolution (train)</p>
        <p>Learning rate</p>
        <p>Generator
Epochs (LR decay)
#Conv. (IAMNet)</p>
        <p>Value
see the potential of each attention block alone.</p>
      </sec>
      <sec id="sec-6-4">
        <title>6.4. Attention Module analysis</title>
        <p>The attention blocks work as coeficients on the feature maps in the bottleneck of IAMNet, applied
sequentially, without changing the data dimensionality. Therefore, we can test each of them individually
to see how they perform and which one is the most efective one. As shown in Table</p>
        <sec id="sec-6-4-1">
          <title>2, the Channel</title>
          <p>Attention Block performs best in the segmented breast; however, when it comes to the contrast of
the mass, the Multi-Scale Channel Attention outperforms the Channel Attention. Moreover, Spatial
Attention showed worse results in both mass and breast-segmented settings for all metrics. This is
potentially due to the fact that at the bottleneck, hence the spatial dimension of feature maps is already
shrunk and can not provide valuable information.</p>
        </sec>
      </sec>
      <sec id="sec-6-5">
        <title>6.5. Attention Module comparison</title>
        <p>We investigated the use of attention mechanisms to guide the model in focusing on more relevant
features in Table 4. In contradiction to the potential performance improvement expected from the
Attention U-Net, in this specific study of medical imaging, we see even lower performance results
compared to the U-Net baseline. By comparing IAMNet with U-Net and U-Net with exponential
dampening, we observed that IAMNet performed worse in both ”With mass” and ”Without mass” breast
segmented scenarios. Our initial hypothesis was that IAMNet might perform better in reproducing
the specific regions containing the masses. To test this, we segmented the images and computed
the performance metrics exclusively for the mass regions. In this targeted analysis, IAMNet showed
improved performance, suggesting that it is better suited for focusing on specific lesions. This result
indicates a trade-of: while IAMNet’s overall reconstruction performance across the entire image was
inferior, it demonstrated enhanced performance in identifying and segmenting the lesion areas.</p>
      </sec>
      <sec id="sec-6-6">
        <title>6.6. Exponential dampening loss</title>
        <p>We report in Table 3 our results, computed using a U-Net baseline architecture as the generator. We
mainly observe similar and consistent results across diferent selected  values. We found a small but
consistent improvement, with higher  values, with best results obtained with  = 2 and  = 3 . If we take
into account only the segmented mass area, we find higher errors, meaning that it is generally harder
for the model to reconstruct the correct intensity. Nonetheless, we observe the highest improvement,
which shows that higher  values improve the reconstruction of bright spots in the image. Interestingly,
we found that annealing the  value from a high to a low value had the best performance. This could be
intended as a form of curriculum learning, where the model is first trained to solve a specific task and
then moves to other tasks to improve its overall performance.</p>
        <p>Model
 = 0.0
 = 0.5
 = 1.0
 = 2.0
 = 3.0</p>
        <p>Finally, in Figure 5, we analyze the frequency of the pixel intensities of the generated images and
compare them to the original DES images. We find results consistent with the existing literature [ 17],
as our models are consistently able to reproduce most of the target data distribution.</p>
      </sec>
      <sec id="sec-6-7">
        <title>6.7. Qualitative comparison</title>
        <p>In Fig. 6, we show a comparison of input images (pLE, mammogram-equivalent images without contrast
enhancement), ground-truth images (DES, contrast-enhanced images), and generated images. The
ground-truth DES images (b) demonstrate multiple bright spots, which in CEM images can represent
the breast border, normal tissue, or masses. The presence, position, and intensity of these bright areas
are crucial for radiologists, as they guide the interpretation of the images and aid in the detection of
potential abnormalities and the identification of potentially cancerous masses. We can observe that
the U-Net usually reduces many bright details. Using  = 2 , bright spots are more preserved, but still
not closely resembling when compared to the ground truth. Attention U-Net presents some white
artifacts and lacks fine detail definition. Finally, our IAMNet shows better performance by efectively
highlighting the bright details in the DES image, closely resembling both brightness and location of the
bright area found in the ground truth, potentially enough to raise suspicion of potential abnormalities.
(a) pLE
(b) DES
(c) U-Net
(d) U-Net ( = 2)
(e) Attn. U-Net
(f) IAMNet</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>7. Conclusions</title>
      <p>The results in this paper suggest that investigating novel architecture and losses is an efective way to
address many of the challenges in image-to-image translation for CEM medical images. We provide some
preliminary experimental data that shows our approach has potential, but requires further improvement
to reach clinical utility. Despite achieving a high quantitative metrics, the generated images miss or
inaccurately represent the anatomical details and occasionally fail to clearly delineate critical features
such as tumor boundaries or vascular structures, limiting their current diagnostic value.</p>
      <p>The first step to address these shortcomings is to explore whether incorporating skip connections
into the IAMNet architecture could enhance the preservation of clinically relevant details. Interestingly,
despite lacking skip connections, IAMNet has shown the ability to outperform other models that rely
on them. Furthermore, integrating the proposed loss function into new architectures must be explored
further to refine image quality and maintain diagnostic integrity.</p>
      <p>In future work, we want to address more of the challenges we stated, considering better ways to
evaluate our results, both in a qualitative and quantitative way. Moreover, we want to consider in a
systematic way whether mode collapse is a measurable issue for generative models applied to medical
images.
mammography: techniques, current results, and potential indications, Clinical Radiology 68 (2013)
935–944. doi:10.1016/j.crad.2013.04.009.
[6] G. Gennaro, A. Cozzi, S. Schiafino, F. Sardanelli, F. Caumo, Radiation dose of contrast-enhanced
mammography: A two-center prospective comparison, Cancers 14 (2022) 1774. doi:10.3390/
cancers14071774.
[7] W. Bottinor, P. Polkampally, I. Jovin, Adverse reactions to iodinated contrast media, International</p>
      <p>Journal of Angiology 22 (2013) 149–154. doi:10.1055/s-0033-1348885.
[8] American College of Radiology, Acr manual on contrast media, version 10.3, 2017. URL: https:
//www.acr.org/-/media/ACR/Files/Clinical-Resources/Contrast_Media.pdf.
[9] K. Armanious, C. Jiang, M. Fischer, T. Küstner, T. Hepp, K. Nikolaou, S. Gatidis, B. Yang, Medgan:
Medical image translation using gans, Computerized Medical Imaging and Graphics 79 (2020)
101684. URL: http://dx.doi.org/10.1016/j.compmedimag.2019.101684. doi:10.1016/j.compmedimag.
2019.101684.
[10] C. Zakka, G. Saheb, E. Najem, G. Berjawi, Mammoganesis: Controlled generation of high-resolution
mammograms for radiology education, arXiv (2020). doi:10.48550/ARXIV.2010.05177, published
online.
[11] M. Gong, S. Chen, Q. Chen, Y. Zeng, Y. Zhang, Generative adversarial networks in
medical image processing, Current Pharmaceutical Design 27 (2021) 1856–1868. doi:10.2174/
1381612826666201125110710.
[12] O. Oktay, J. Schlemper, L. L. Folgoc, M. Lee, M. Heinrich, K. Misawa, K. Mori, S. McDonagh, N. Y.</p>
      <p>Hammerla, B. Kainz, B. Glocker, D. Rueckert, Attention u-net: Learning where to look for the
pancreas, 2018. URL: https://arxiv.org/abs/1804.03999. arXiv:1804.03999.
[13] G. Müller-Franzes, L. Huck, M. Bode, et al., Difusion probabilistic versus generative adversarial
models to reduce contrast agent dose in breast mri, European Radiology Experimental 8 (2024).
doi:10.1186/s41747-024-00451-3.
[14] T.-C. Wang, M.-Y. Liu, J.-Y. Zhu, A. Tao, J. Kautz, B. Catanzaro, High-resolution image synthesis and
semantic manipulation with conditional gans, 2018 IEEE/CVF Conference on Computer Vision and
Pattern Recognition (2017) 8798–8807. URL: https://api.semanticscholar.org/CorpusID:41805341.
[15] J.-Y. Zhu, T. Park, P. Isola, A. A. Efros, Unpaired image-to-image translation using cycle-consistent
adversarial networks, 2020. URL: https://arxiv.org/abs/1703.10593. arXiv:1703.10593.
[16] J. Ho, A. Jain, P. Abbeel, Denoising difusion probabilistic models, ArXiv abs/2006.11239 (2020).</p>
      <p>URL: https://api.semanticscholar.org/CorpusID:219955663.
[17] P. Isola, J.-Y. Zhu, T. Zhou, A. A. Efros, Image-to-image translation with conditional adversarial
networks, 2018. URL: https://arxiv.org/abs/1611.07004. arXiv:1611.07004.
[18] X. Liu, C. Gong, Q. Liu, Flow straight and fast: Learning to generate and transfer data with rectified
lfow, 2022. URL: https://arxiv.org/abs/2209.03003. arXiv:2209.03003.
[19] T.-Y. Lin, P. Goyal, R. Girshick, K. He, P. Dollár, Focal loss for dense object detection, 2018. URL:
https://arxiv.org/abs/1708.02002. arXiv:1708.02002.
[20] J.-Y. Zhu, R. Zhang, D. Pathak, T. Darrell, A. A. Efros, O. Wang, E. Shechtman, Toward multimodal
image-to-image translation, in: Neural Information Processing Systems, 2017. URL: https://api.
semanticscholar.org/CorpusID:19046372.
[21] I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, Y. Bengio,</p>
      <p>Generative adversarial networks, 2014. URL: https://arxiv.org/abs/1406.2661. arXiv:1406.2661.
[22] M. Mirza, S. Osindero, Conditional generative adversarial nets, 2014. URL: https://arxiv.org/abs/
1411.1784. arXiv:1411.1784.
[23] J. Gauthier, Conditional generative adversarial nets for convolutional face generation, 2015. URL:
https://api.semanticscholar.org/CorpusID:3559987.
[24] O. Ronneberger, P. Fischer, T. Brox, U-net: Convolutional networks for biomedical image
segmentation, in: N. Navab, J. Hornegger, W. M. Wells, A. F. Frangi (Eds.), Medical Image Computing and
Computer-Assisted Intervention – MICCAI 2015, Springer International Publishing, Cham, 2015,
pp. 234–241.
[25] S. V M, S. George, A review on medical image denoising algorithms, Biomedical Signal Processing</p>
    </sec>
    <sec id="sec-8">
      <title>Acknowledgments</title>
      <p>We would like to thank the Istituto Oncologico Veneto (IOV – IRCCS) for providing the dataset used in
this research. https://www.ioveneto.it/en/</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>T.</given-names>
            <surname>Shah</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Guraya</surname>
          </string-name>
          ,
          <article-title>Breast cancer screening programs: Review of merits, demerits, and recent recommendations practiced across the world</article-title>
          ,
          <source>Journal of Microscopy and Ultrastructure</source>
          <volume>5</volume>
          (
          <year>2017</year>
          )
          <article-title>59</article-title>
          . doi:
          <volume>10</volume>
          .1016/j.jmau.
          <year>2016</year>
          .
          <volume>10</volume>
          .002.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>D. S.</given-names>
            <surname>Al Mousa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. A.</given-names>
            <surname>Ryan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Mello-Thoms</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. C.</given-names>
            <surname>Brennan</surname>
          </string-name>
          ,
          <article-title>What efect does mammographic breast density have on lesion detection in digital mammography?</article-title>
          ,
          <source>Clinical Radiology</source>
          <volume>69</volume>
          (
          <year>2014</year>
          )
          <fpage>333</fpage>
          -
          <lpage>341</lpage>
          . doi:
          <volume>10</volume>
          .1016/j.crad.
          <year>2013</year>
          .
          <volume>11</volume>
          .014.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>N.</given-names>
            <surname>Fico</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. D.</given-names>
            <surname>Grezia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Cuccurullo</surname>
          </string-name>
          , et al.,
          <article-title>Breast imaging physics in mammography (part ii)</article-title>
          ,
          <source>Diagnostics</source>
          <volume>13</volume>
          (
          <year>2023</year>
          )
          <fpage>3582</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>I. P. L.</given-names>
            <surname>Houben</surname>
          </string-name>
          ,
          <string-name>
            <surname>P. Van De Voorde</surname>
            ,
            <given-names>C. R. L. P. N.</given-names>
          </string-name>
          <string-name>
            <surname>Jeukens</surname>
          </string-name>
          , et al.,
          <article-title>Contrast-enhanced spectral mammography as work-up tool in patients recalled from breast cancer screening has low risks and might hold clinical benefits</article-title>
          ,
          <source>European Journal of Radiology</source>
          <volume>94</volume>
          (
          <year>2017</year>
          )
          <fpage>31</fpage>
          -
          <lpage>37</lpage>
          . doi:
          <volume>10</volume>
          .1016/j. ejrad.
          <year>2017</year>
          .
          <volume>07</volume>
          .00.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>M. B. I.</given-names>
            <surname>Lobbes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. L.</given-names>
            <surname>Smidt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Houwers</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V. C.</given-names>
            <surname>Tjan-Heijnen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. E.</given-names>
            <surname>Wildberger</surname>
          </string-name>
          ,
          <source>Contrast enhanced and Control</source>
          <volume>61</volume>
          (
          <year>2020</year>
          )
          <article-title>102036</article-title>
          . doi:
          <volume>10</volume>
          .1016/j.bspc.
          <year>2020</year>
          .
          <volume>102036</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [26]
          <string-name>
            <given-names>C.</given-names>
            <surname>Tomasi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Manduchi</surname>
          </string-name>
          ,
          <article-title>Bilateral filtering for gray and color images</article-title>
          ,
          <source>Sixth International Conference on Computer Vision</source>
          (IEEE Cat.
          <source>No.98CH36271)</source>
          (
          <year>1998</year>
          )
          <fpage>839</fpage>
          -
          <lpage>846</lpage>
          . URL: https://api.semanticscholar. org/CorpusID:14308539.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [27]
          <string-name>
            <given-names>G.</given-names>
            <surname>Charpiat</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Girard</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Felardos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Tarabalka</surname>
          </string-name>
          ,
          <article-title>Input similarity from the neural network perspective</article-title>
          ,
          <source>in: Neural Information Processing Systems</source>
          ,
          <year>2019</year>
          . URL: https://api.semanticscholar. org/CorpusID:202779680.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [28]
          <string-name>
            <given-names>J.</given-names>
            <surname>Johnson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Alahi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Fei-Fei</surname>
          </string-name>
          ,
          <article-title>Perceptual losses for real-time style transfer and super-</article-title>
          <string-name>
            <surname>resolution</surname>
          </string-name>
          ,
          <year>2016</year>
          . URL: https://arxiv.org/abs/1603.08155. arXiv:
          <volume>1603</volume>
          .
          <fpage>08155</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [29]
          <string-name>
            <given-names>C.</given-names>
            <surname>Depretto</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Borelli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Liguori</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Presti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Vingiani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Cartia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Ferranti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. P.</given-names>
            <surname>Scaperrotta</surname>
          </string-name>
          ,
          <article-title>Contrast-enhanced mammography in the evaluation of breast calcifications: preliminary experience</article-title>
          ,
          <source>Tumori Journal</source>
          <volume>106</volume>
          (
          <year>2020</year>
          )
          <fpage>491</fpage>
          -
          <lpage>496</lpage>
          . URL: https://api.semanticscholar.org/CorpusID: 219553317.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [30]
          <string-name>
            <given-names>W.</given-names>
            <surname>Lotter</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Kreiman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Cox</surname>
          </string-name>
          ,
          <article-title>Unsupervised learning of visual structure using predictive generative networks</article-title>
          ,
          <year>2016</year>
          . URL: https://arxiv.org/abs/1511.06380. arXiv:
          <volume>1511</volume>
          .
          <fpage>06380</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [31]
          <string-name>
            <given-names>S.</given-names>
            <surname>Woo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Park</surname>
          </string-name>
          , J.-
          <string-name>
            <given-names>Y.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I. S.</given-names>
            <surname>Kweon</surname>
          </string-name>
          , Cbam: Convolutional block attention module,
          <year>2018</year>
          . URL: https://arxiv.org/abs/
          <year>1807</year>
          .06521. arXiv:
          <year>1807</year>
          .06521.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [32]
          <string-name>
            <given-names>H.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Xiong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>An</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <article-title>Pyramid attention network for semantic segmentation</article-title>
          ,
          <year>2018</year>
          . URL: https://arxiv.org/abs/
          <year>1805</year>
          .10180. arXiv:
          <year>1805</year>
          .10180.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [33]
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Girshick</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Gupta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <article-title>Non-local neural networks</article-title>
          ,
          <year>2018</year>
          . URL: https://arxiv.org/ abs/1711.07971. arXiv:
          <volume>1711</volume>
          .
          <fpage>07971</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [34]
          <string-name>
            <given-names>P.</given-names>
            <surname>Cao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Xie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. Zhang,</surname>
          </string-name>
          <article-title>Msanet: Multi-scale attention networks for image classification</article-title>
          ,
          <source>Multimedia Tools and Applications</source>
          <volume>81</volume>
          (
          <year>2022</year>
          )
          <fpage>34325</fpage>
          -
          <lpage>34344</lpage>
          . URL: https://api. semanticscholar.org/CorpusID:248782198.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [35]
          <string-name>
            <given-names>J.</given-names>
            <surname>Phillips</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Steinkeler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Talati</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Brook</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Dialani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Fishman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. J.</given-names>
            <surname>Slanetz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. S.</given-names>
            <surname>Mehta</surname>
          </string-name>
          ,
          <article-title>Workflow considerations for incorporation of contrast-enhanced spectral mammography into a breast imaging practice</article-title>
          ,
          <source>Journal of the American College of Radiology</source>
          <volume>15</volume>
          (
          <year>2018</year>
          )
          <fpage>881</fpage>
          -
          <lpage>885</lpage>
          . URL: https://www.sciencedirect.com/science/article/pii/S1546144018302059. doi:https://doi.org/10. 1016/j.jacr.
          <year>2018</year>
          .
          <volume>02</volume>
          .012.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>