Coupled Feedback Attention Networks 1
Rong Wang1*, Chunjiang Duanmu2
1
College of Mathematics and Computer Science, Zhejiang Normal University, Jin Hua, Zhejiang, China
2
College of Physics and Electronic Information Engineering, Zhejiang Normal University, Jin Hua, Zhejiang,
China

                 Abstract
                 In their daily lives, people frequently need to obtain images with a high dynamic range and
                 resolution. Due to technological equipment limitations, high dynamic range images are
                 produced by multi-exposure fusion (MEF) of low dynamic range images, while high resolution
                 images are frequently obtained by super-resolution (SR) of low resolution images. MEF and
                 SR are often analyzed separately. This research examines existing approaches and proposes a
                 coupled feedback network attention network and its method to address the issue that current
                 models cannot achieve high dynamic range and high resolution simultaneously.

                 Keywords
                 channel attention mechanism; coupled feedback mechanism;

1 Introduction

    High dynamic range (HDR) images contain a broader dynamic range and richer texture features
compared to typical low dynamic range (LDR) images and low resolution (LR) images, and high
resolution (HR) images can enhance object detection accuracy. Technical methods to obtain HDR
images and HR images, respectively, include single image super resolution (SISR) and multi-exposure
image fusion (MEF).
    By fusing two LDR images, the extreme exposure image fusion method creates an HDR image. Ma
et al.[8] provided a quick approach for fusing multiple exposure images that improved the initial weights
using a guided filter. Later, Xu et al[7] proposed a unified unsupervised fusion method that overcomes
the fusion barrier of most images by constraining the similarity between the fused image and the original
image.
    With the continuous development of deep neural networks, many CNN-based methods have been
proposed in the field of SISR. RCAN[4] introduces an attention mechanism to further improve the
reconstruction quality. SRFBN[2] introduces a feedback structure to optimize shallow features through
iteration to produce deeper features.
    The above MEF and SISR methods are used to solve the LDR and LR problems, respectively, but
in real life, people often need to see HDR and HR images on cell phones or TVs, so the joint MEF and
SR methods are necessary. This paper proposes a coupled feedback attention network-based image
exposure fusion and super-resolution method, which can effectively suppress the superposition of
redundant information in cyclic iterations, improve the quality of parameter sharing as well as exposure
feature propagation.

2 Coupled Feedback Attention Network

   In order to solve the propagation of redundant features and enhance the propagation of useful
features in the coupled feedback network, this paper combines the coupled attention mechanism and
feedback mechanism, and proposes an image exposure fusion and image super-resolution method based

AIoTC2022@International Conference on Artificial Intelligence, Internet of Things and Cloud Computing Technology
EMAIL: 2101440741@qq.com (Rong Wang), duanmu@zjnu.cn (Chunjiang Duanmu)
              © 2022 Copyright for this paper by its authors.
              Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
              CEUR Workshop Proceedings (CEUR-WS.org)


                                                                                   8
on the coupled feedback attention network.

2.1 Basic network structure

   The structure of the coupled feedback network is shown in Fig. 1. The shallow features 𝐹 and
𝐹 go through T rounds of iteration by the coupled feedback attention module in the upper and lower
network, respectively. The feedback features in each iteration combine the feedback features in the other
network and the shallow features in this network, together as the input of the next iteration, to achieve
the refinement fused features. The coupled-feedback attention layer contains multiple coupled-feedback
blocks and an attention module.
   The extraction process of shallow features 𝐹 and 𝐹 of LR images can be expressed as
                                                                                     𝐹 =𝑓                    (𝐼 )
                                                                                     𝐹 =𝑓                    (𝐼 )

   where 𝑓      contains two convolutional layers Conv(3,4×m) and Conv(1,m), which are used to
extract LR features and compress LR features, respectively. The extracted shallow features are first
passed through SRB to obtain the deep features 𝐺 and 𝐺 , which can be expressed as
                                                                                     𝐺 =𝑓               (𝐹 )
                                                                                     𝐺 =𝑓               (𝐹 )

   where 𝑓 is the super-resolution module (SRB) operation.
   Next, the deep exposure features of the two sub-networks are deeply fused after several iterations.
At each iteration, the feedback features of the previous iteration are coupled and the shallow features
𝐹 and 𝐹 of the respective networks are together as the input of this iteration, and the feedback
features 𝐶 and 𝐶 of the t-th iteration can be expressed as
                                                                         𝐶 =𝑓                (𝐹 , 𝐺                 ,𝐺        )
                                                                         𝐶 =𝑓                (𝐹 , 𝐺                 ,𝐺        )

   where 𝑓         is the operation of the coupled feedback attention module. At the first iteration, 𝐺
and 𝐺       are the outputs 𝐺 and 𝐺 of the SRB, respectively.
   Finally, the output of the coupled feedback attention module of each iteration and super-resolution
features after channel attention module is reconstructed by the reconstruction module REC to obtain the
SR residual image, then summed with the up-sampling of the corresponding LR image to produce the
SR image:
                                                                         𝐼 =𝑓                (𝐶 ) + 𝑓 (𝐼 )
                                                                         𝐼 =𝑓                (𝐶 ) + 𝑓 (𝐼 )
                                              Upsample

                                                                                                    SR
                                                               REC                             over-exposed
                                                                                                  Output

                      LR                                       CA
                                                                                                                    CA                      REC
                 over-exposed
                                                                                                                                     ⋮
                    Input                                                         𝒐
                                𝑰𝒐𝒍𝒓         𝑭𝒐𝒊𝒏         𝑮𝒐                     𝑮𝒕−𝟏                                         𝑮𝒐𝑻
                                       FEB          SRB              c   CFB             c   CFB                c    CFB             CA     REC
                                                                               𝑮𝒐𝟏                     𝑮𝒐𝒕
                                                                                                                                                   W
                                              𝒖
                                𝑰𝒖𝒍𝒓         𝑭𝒊𝒏                               𝑮𝒖𝟏                     𝑮𝒖𝒕
                                       FEB          SRB              c   CFB             c   CFB                c    CFB             CA      REC
                                                          𝑮𝒖                     𝑮𝒖𝒕−𝟏                                        𝑮𝒖𝑻                          SR
                      LR                                                                                                             ⋮                 Fused image
                 under-exposed
                                                                                                                     CA                     REC
                     input
                                                               CA


                                                               REC                                  SR
                                                                                               under-exposed
                                                                                                  Output
                                             Upsample


                                        C     Concatenation                              Element sum                     WS         Weighted sum


Figure 1 Coupled feedback attention network


                                                                                                   9
2.2 Coupled Feedback Attention Module

    This section specifically describe the specific iterative process of the coupled feedback block and
channel attention module.
    As shown in Fig. 2, the coupled feedback attention structure mainly contains iterative convolutional
and deconvolutional layers constituting the CFB, and channel attention gates.
    According to 3.1, in the upper sub-network, the inputs of the coupled feedback attention module are
𝐺 , 𝐺 , 𝐹 . firstly, the channel compression is performed through the convolutional layer Conv(1,m)
to obtain the input 𝐿 (0) of the coupled feedback attention module.
                                        𝐿 (0) = 𝑓 ([𝐺 , 𝐺 , 𝐹 ])

   Next, multiple working groups consisting of convolutional and deconvolutional layers, the HR
feature 𝐻 (𝑛) of the n-th working group in the t-th iteration can be expressed as
                                𝐻 (𝑛) = 𝑓    ([𝐿 (0), 𝐿 (1), … , 𝐿 (𝑛 − 1)])

   where 𝑓 is the deconvolution layer Deconv(3,m). The HR features are generated by upsampling
the LR features jointly from the first n-1 workgroups. Similarly, LR features 𝐿 (𝑛) can be expressed
as
                               𝐿 (𝑛) = 𝑓     ([𝐻 (1), 𝐻 (2), … , 𝐿 (𝑛 − 1)])

  where 𝑓       is the convolutional layer Conv(3,m).
  The output of the final N-th working group is generated by the joint LR features of the previous N
working groups passing through the convolution layer Conv(1,m) as follows.
                                    𝐺 =𝑓      (𝐿 (1), 𝐿 (2), … , 𝐿 (𝑁)])

   The above describes the iterative process of the extreme high exposure branch, and the iterative
process of the extreme low exposure branch is the same.
   The feedback features 𝐺 and 𝐺 are output from each iteration, go through the channel attention
module CA for feature optimization. The CA in this paper consists of three steps, which are global
information compression, scaling and excitation, and recalibration.
   1） Global information compression
   In order to obtain the global information of each channel, this paper represents the feature values of
each channel by global averaging pooling:
                                              1
                                       𝑔 =                  𝐺 (𝑖, 𝑗)
                                             𝐻×𝑊

                                              1
                                       𝑔 =                  𝐺 (𝑖, 𝑗)
                                             𝐻×𝑊

   where 𝐺 (𝑖, 𝑗) and 𝐺 (𝑖, 𝑗) are the values at each position in the output extreme exposure feature,
and compresses the multiple channels into a one-dimensional feature tensor.
   2） Squeeze and excitation
   In order to more fully explore the dependencies between individual channels, the paper introduces a
gate mechanism for learning the nonlinear mapping between each channel and uses a sigmoid activation
function to avoid the formation of adversarial relationships between channels, which can be expressed
as
                                            𝑠 = 𝜎(𝑊 𝛿(𝑊 𝑔 ))
                                            𝑠 = 𝜎(𝑊 𝛿(𝑊 𝑔 ))

  Where 𝑊 and 𝑊 are the convolutional layer weights.
  3） Recalibration
  The original input features 𝐺 individual channels are scaled by the channel attention weight
matrix just learned, thus enhancing useful features and suppressing useless features:


                                                     10
                                             𝐺 × (𝑠 + 1)          𝑡=1
                                 𝐶 =
                                             𝐺 ×𝑠   + 𝐺 × (𝑠 + 1) 𝑡 > 1
                                             𝐺 × (𝑠 + 1)          𝑡=1
                                𝐶 =
                                             𝐺 ×𝑠   + 𝐺 × (𝑠 + 1) 𝑡 > 1

   Where 𝑠    and 𝑠    are the channel attention weights of the previous iteration.
                                  SR fused          SR fused         SR fused         SR fused
                                   image             image            image            image


                                   Channel            Channel          Channel              Channel
                                    Gate               Gate             Gate                 Gate


                                    CFB                CFB              CFB                  CFB


                                  LR coupled        LR coupled        LR coupled          LR coupled
                                   features          features          features            features


Figure 2 Coupled feedback attention structure

2.3 Loss Function

   The method in this paper mainly achieves image super-resolution and image multi-exposure fusion,
so the model in this paper uses a hierarchical loss function for optimization, and the loss function is
expressed as
                   𝐿   =𝜆 𝐿     𝐼 ,𝐼         +𝜆 𝐿     𝐼 ,𝐼       +    𝜆 (𝐿         𝐼 ,𝐼       +𝐿       𝐼 ,𝐼   )

   Where 𝐼 and 𝐼 are the HR standard images with extreme exposure, and 𝐼 is the HDR, HR
standard image, which is the target to be achieved in the final fusion image. 𝜆 , 𝜆 , {𝜆 } are the
weight coefficients of each loss part. In this paper, we set 𝜆 = 𝜆 = {𝜆 }   = 1.

3 Experiment and Analysis

3.1 Experiment Establishment

      1）Experimental setup
      In this paper, the training model was trained on GeForce GTX 1070Ti.The experiments in this paper
mainly use SICE [5] dataset, which contains 589 high-quality reference images and their corresponding
image sequences, and only extremely exposure are used in this paper.
      2）Comparison Method
      The network model proposed in this paper achieves both image super-resolution and image exposure
fusion, we combine the current image super-resolution method and the image exposure fusion method
as a comparison method. The image super-resolution methods are DBPN[3], RCAN[4], SRFBN[2], and
SwinIR[9], and the main image exposure fusion methods are MGFF [10], FAST SPD-MEF [6], MEF-Net
[8]
    , and U2Fusion [7]. We combined SR methods and MEF methods, and changed the order of SR
methods and MEF methods, i.e., SR+MEF or MEF+SR, to generate 32 comparison methods. The CF-
Net [1] was also selected for comparison.

3.2 Objective evaluation

   In order to verify the effectiveness of the method in this paper under magnification factor of 2, we
use the SICE dataset and compare it with other advanced methods. These comparison methods are
combined by SR method and MEF method. Table 1 shows the results of our method with the comparison
methods for magnification factor of 2 under three metrics.

                                                                11
   In Table 1, highlighting the first value of the fusion quality index in bold and the second ranked
value in underline. From Table 1, we can see that the method of this paper has the best fusion effect,
ranking first among 34 methods in metrics. PSNR index is improved by 0.25 dB, SSIM by 0.0028, and
MEF-SSIM by 0.0005 compared to the second place CF-Net method.

Table 1. comparison of the fusion results under the magnification factor of 2
                                                         Super Resolution + Image Fusion
   Methods                MGFF[10]                     FAST SPD-MEF[6]                     MEF-Net[8]                        U2Fusion[7]
 Combinations    PSNR      SSIM        MEF-       PSNR       SSIM        MEF-       PSNR      SSIM       MEF-       PSNR       SSIM         MEF-
                                       SSIM                              SSIM                            SSIM                               SSIM
   DBPN[3]      17.47dB    0.7434     0.9121     17.30dB    0.7615      0.8976     17.26dB   0.7660     0.8888     17.83dB     0.7423      0.8807
   RCAN[4]      17.39dB    0.7406     0.9114     17.34dB    0.7618      0.8974     17.24dB   0.7653     0.8882     17.85dB     0.7409      0.8804
  SRFBN[2]      17.48dB    0.7425     0.9130     17.34dB    0.7601      0.8983     17.29dB   0.7641     0.8895     17.84dB     0.7402      0.8811
  SWinIR[9]     17.44dB    0.7436     0.9113     17.26dB    0.7618      0.8968     17.23dB   0.7667     0.8881     17.82dB     0.7436      0.8802
                                                         Image Fusion + Super Resolution
   Methods                DBPN[3]                           RCAN[4]                         SRFBN[2]                          SWinR[9]
 Combinations    PSNR      SSIM        MEF-       PSNR       SSIM        MEF-       PSNR      SSIM       MEF-       PSNR       SSIM         MEF-
                                       SSIM                              SSIM                            SSIM                               SSIM
  MGFF[10]      17.27dB    0.7161     0.9144     17.18dB    0.7122      0.9135     17.38dB   0.7218     0.9158     17.19dB     0.7135      0.9131
  Fast SPD-     17.26dB    0.7554     0.8954     17.24dB    0.7533      0.8949     17.31dB   0.7557     0.8962     17.21dB     0.7546      0.8944
   MEF[6]

 MEF-Net[8]     17.25dB  0.7636       0.8886     17.23dB    0.7624    0.8882    17.27dB     0.7630      0.8892     17.20dB    0.7629     0.8878
 U2Fusion[7]    17.81dB  0.7384       0.8843     17.82dB    0.7368    0.8837    17.85dB     0.7395      0.8850     17.76dB    0.7374     0.8835
  CF-Net[1]        PSNR=21.24dB                                           SSIM=0.8140                                        MEF-SSIM=0.9332
    Ours           PSNR=21.49dB                                           SSIM=0.8168                                        MEF-SSIM=0.9337


3.3 Subjective evaluation

    Fig. 3 visually depicts the fused images produced by this paper and other advanced methods at
magnification of factor 2. From the experimental results, it can be seen that compared with SR+MEF
and MEF+SR methods, the method in this paper has a great improvement in details, and compared with
the coupled feedback network, this paper alleviates the phenomenon that there is redundant information
in the image due to the coupled feedback mechanism.


                             (a)Over-exposed input            (b)Under-exposed input        (c)DBPN+Fast SPD-MEF


                          (d)Fast SPD-MEF+RCAN                  (e)MEF-Net+DBPN                   (f)MGFF+SRFBN


                             (g)RCAN+U2Fusion                   (h)SRFBN+MGFF                 (i)SwinIR+MEF-Net


                            (j)U2Fusion+RCAN                         (k)CF-Net                           (l)Ours

Figure 3 Comparison of different methods of "landscape" under 2×

4 Conclusion

   Based on the powerful image reconstruction property of feedback mechanism and the property that
channel attention mechanism can distinguish the importance of features. In this paper, a coupled
feedback attention network is proposed to solve the image super-resolution problem and image exposure
fusion problem simultaneously. The experimental results show that the algorithm in this paper retains

                                                                      12
the detailed information of edges, region boundaries and textures of the original image sequence.

5 References

[1] Deng X., Zhang Y. T., Xu M., et al. Deep coupled feedback network for joint exposure fusion and
     image super-resolution[J]. IEEE Transactions on Image Processing. 2021, 30:3098-3112.
[2] Li Z., Yang J. L., Liu Z., et al. Feedback Network for Image Super-Resolution[C]. IEEE/CVF
     Conference on Computer Vision and Pattern Recognition. 2019.
[3] Harris M., Shakhnarovich G., Ukita N., et al. Deep back-projectinetworks for single image super-
     resolution[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2021,
     43(12):4323-4337.
[4] Zhang T. L., Li K. P., Li K., et al. Image Super-Resolution Using Very Deep Residual Channel
     Attention Networks[C]. European Conference on Computer Vision. 2018.
[5] Cai J. R., Gu S. H. & Zhang L. Learning a deep single image contrast enhancer from multi-exposure
     images[J]. IEEE Transactions on Image Processing. 2018, 27(4): 2049-2062.
[6] Li H., Ma KD., Yong H. W., et al. Fast multi-scale structural patch decomposition for multi-exposure
     image fusion[J]. IEEE Transactions on Image Processing. 2020, 29: 5805-5816.
[7] Xu X., Ma J. Y., Jiang J. J., et al. U2Fusion: A uniﬁed unsupervised image fusion network[J]. IEEE
     Transactions on Pattern Analysis and Machine Intelligence. 2022, 44(1): 502-518.
[8] Ma K., Duanmu Z., Zhu H., Fang Y., et al. Deep guided learning for fast multi-exposure image
     fusion[J]. IEEE Transactions on Image Processing. 2020, 29: 2808–2819.
[9] Liang J. Y., Cao J. Z., Sun G. L., et al. SwinIR: Image Restoration Using Swin Transformer[C].
     IEEE/CVF International Conference on Computer Vision Workshops. 2021. Electr Network.
[10] Bavirisetti D. P., Xiao G., Zhao J. H., et al. Multi-scale guided image and video fusion: a fast and
     efficient approach[J]. Circuits Systems and Signal Processing. 2019, 38(12): 5576-5605.


                                                   13