Attention U-Net Based Adversarial Architectures for
                Chest X-ray Lung Segmentation
                                      Gusztáv Gaál 1 and Balázs Maga 2 and András Lukács 3


Abstract. X-ray is by far the most common among medical imag-             instance segmentation on chest X-rays and obtained state-of-the-art
ing modalities, being faster, more accessible, and more cost-effective    results [12, 5].
compared to other radiographic methods. Chest X-ray (CXR) is the
most commonly requested test due to its contribution to the early de-
tection of lung cancer. The most important biomarker in detecting         2     DEEP LEARNING APPROACH
cancer of the lung are nodules, and in finding those, lung segmen-        2.1    Network Architecture
tation of chest X-rays is essential. Another goal is interpretability,
helping radiologists integrate computer-aided detection methods into      Our goal is to produce accurate organ segmentation masks on chest
their diagnostic pipeline, greatly reducing their workload. For this      X-rays, meaning for input images we want pixel-wise dense predic-
reason, a robust algorithm to perform this otherwise arduous seg-         tions regarding if the given pixel is either part of the left lung, the
mentation task is much desired in the field of medical imaging. In        right lung, the heart, or none of the above.
this work, we present a novel deep learning approach that uses state-     For this purpose Fully Convolutional Networks (FCNs) are known to
of-the-art fully convolutional neural networks in conjunction with an     significantly outperform other widely used registration-based meth-
adversarial critic model. Our network generalized well to CXR im-         ods. Specifically we applied a U-Net architecture, thus enabling us
ages of unseen datasets with different patient profiles, achieving a      to efficiently compute the segmentation mask in the same resolution
final DSC of 97.5% on the JSRT CXR dataset.                               as the input images. The fully convolutional architecture also enables
                                                                          the use images of different resolutions, since unlike standard convo-
                                                                          lutional networks, FCNs don’t contain input-size dependent layers.
1    INTRODUCTION                                                         In [9] it has been shown that for medical image analysis tasks the
                                                                          integration of the proposed Attention Gates (AGs) improved the ac-
X-ray is the most commonly performed radiographic examination,            curacy of the segmentation models, while preserving computational
being significantly easier to access, cheaper and faster to carry out     efficiency. The architecture of the proposed Attention U-Net is de-
than computed tomography (CT), diagnostic ultrasound and mag-             scribed by Figure 1. Without the use of AGs, it’s common practice
netic resonance imaging (MRI), as well as having lower dose of            to use cascade CNNs, selecting a Region Of Interest (ROI) with an-
radiation compared to a CT scan. According to the publicly avail-         other CNN where the target organ is likely contained. With the use of
able, official data of the National Health Service ([2]), in the period   AGs we eliminate the need for such a preselecting network, instead
from February 2017 to February 2018, the count of imaging activity        the Attention U-Net learns to focus on most important local features,
was about 41 million in England, out of which almost 22 million was       and dulls down the less relevant ones. We note that the dulling of less
plain X-ray. Many of these imaging tests might contribute to early di-    relevant local features also result in decreased false positive rates.
agnosis of cancer, amongst which chest X-ray is the most commonly
requested one by general practitioners. In order to identify lung nod-
ules, lung segmentation of chest X-rays is essential, and this step
is vital in other diagnostic pipelines as well, such as calculating the
cardiothoracic ratio, which is the primary indicator of cardiomegaly.
For this reason, a robust algorithm to perform this otherwise arduous
segmentation task is much desired in the field of medical imaging.
   Semantic segmentation aims to solve the challenging problem of
assigning a pre-defined class to each pixel of the image. This task
requires a high level of visual understanding, in which state-of-the-
art performance is attained by methods utilizing Fully Convolutional
Networks (FCN) [7]. In [8], adversarial training is used to enhance
segmentation of colored images. This idea was incorporated to [13]
in order to segment chest X-rays with a fully convolutional, resid-             Figure 1. Schematic architecture of the Attention U-Net [9]
ual neural network. Recently, Mask R-CNN [4] is utilized to realize

1 Eötvös Loránd University, Hungary, email: guzzzti@gmail.com
2 Eötvös Loránd University, Hungary, email: mbalazs0701@gmail.com         In order to enhance the performance of Attention U-Net, we fur-
3 Eötvös Loránd University, Hungary, email: lukacs@cs.elte.hu
                                                                          ther experimented with adversarial techniques, motivated by [13]. In


 Copyright ©2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0)
that work, the authors first designed a Fully Convolutional Network             is defined to be
(FCN) for the lung segmentation task, and noted that in certain cases                                                N
the network tends to segment abnormal and incorrect organ shapes.
                                                                                                                     X
                                                                                                                              pi,c gi,c + ε
For example, the apex of the ribcage might be mistaken as an inter-                                                  i=1
nal rib bone, resulting in the mask “bleeding out” to the background,                                 DSC =     N
                                                                                                                                                ,
                                                                                                                X
which has similar intensity as the lung field. To address this issue,                                                    (pi,c + gi,c ) + ε
they developed an adversarial scheme, leading to a model which they                                             i=1
call Structure Correcting Adversarial Network (SCAN). This archi-
                                                                                where N is the total number of pixels, and ε is introduced for the
                                                                                sake of numerical stability and to avoid divison by 0. The linear Dice
                                                                                Loss (DL) of the multiclass prediction is then
                                                                                                              X
                                                                                                     DL =         (1 − DSCc ) .
                                                                                                                     c

                                                                                A deficiency of Dice Loss is that it penalizes false negative and
                                                                                false positive predictions equally, which results in high precision but
                                                                                low recall. For example practice shows that if the region of interests
                                                                                (ROI) are small, false negative pixels need to have a higher weight
                                                                                than false positive ones. Mathematically this obstacle is easily over-
                                                                                come by introducing weights α, β as tuneable parameters, resulting
                                                                                in the definition of Tversky similarity index [11]:
 Figure 2. Schematic architecture of the Structure Correcting Adversarial
                                                                                                                     N
                            Networks [13]                                                                            X
                                                                                                                             pi,c gi,c + ε
                                                                                                                     i=1
                                                                                      T Ic =   N                      N                      N
                                                                                                                                                                    ,
tecture is based on the idea of the General Adversarial Networks [3].                          X                     X                       X
                                                                                                     pi,c gi,c + α           pi,c gi,c + β          pi,c gi,c + ε
They use the pretrained Fully Convolutional Network as a generator                             i=1                   i=1                      i=1
of a General Adversarial Network, and they also train a critic net-
work which is fed the ground truth mask, the predicted mask and op-             where pi,c = 1 − pi,c and gi,c = 1 − gi,c , that is the overline simply
tionally the original image. The critic network has roughly the same            stands for describing the complement of the class.
architecture, resulting in similar capacity. This approach forces the           Tversky Loss is obtained from Tversky index as Dice Loss was ob-
generator to segment more realistic masks, eventually removing ob-              tained from Dice Score Coefficient:
viously wrong shapes.                                                                                         X
   In our work, besides the standard Attention U-Net, we also created                                  TL =      (1 − T Ic ) .
                                                                                                                         c
a network of analogous structure, in which the FCN used in [13] is
replaced by the Attention U-Net. We did not introduce any modifi-               Another issue with the Dice Loss is that it struggles to segment small
cation in the critic model design, such experiments are left to future          ROIs as they do not contribute to the loss significantly. This difficulty
work.                                                                           was addressed in [1], where the authors introduced the quantity Focal
                                                                                Tversky Loss in order to improve the performance of their lesion
                                                                                segmentation model:
2.2    Tversky Loss
                                                                                                             X                 −1

In the field of medical imaging, Dice Score Coefficient (DSC) is                                    FTL =         (1 − T Ic )γ ,
                                                                                                                     c
probably the most widespread and simple way to measure the overlap
ratio of the masks and the ground truth, and hence to compare and               where γ ∈ [1, 3]. In practice, if a pixel with is misclassified with a
evaluate segmentations. Given two sets of pixels X, Y , their DSC is            high Tversky index, the Focal Tversky Loss is unaffected. However,
                                                                                if the Tversky index is small and the pixel is misclassified, the Focal
                                        2|X ∩ Y |                               Tversky Loss will decrease significantly.
                     DSC(X, Y ) =                  .
                                        |X| + |Y |

If Y is in fact the result of a test about which pixels are in X, we can        2.3    Training
rewrite it with the usual notation true/false positive (TP/FP), false
                                                                                The explanation of the training of our structure correcting network
negative (FN) to be
                                                                                is a bit longer to explain, we directly follow the footsteps of [13].
                                        2T P                                    Let S, D be the segmentation network and the critic network, re-
                DSC(X, Y ) =                       .                            spectively. The data consist of the input images xi and the associ-
                                  2T P + F N + F P
                                                                                ated mask labels yi , where xi is of shape [H, W, 1] for a single-
We would like to use this concept in our setup. The class c we would            channel gray-scale image with height H and width W , and yi is
like to segment corresponds to a set, but it is more appropriate to             of shape [H, W, C] where C is the number of classes including the
consider its indicator function g, that is gi,c ∈ {0, 1} equals 1 if and        background. Note that for each pixel location (j, k), yijkc = 1 for
only if the ith pixel belongs to the object. On the other hand, our pre-        the labeled
                                                                                      0
                                                                                             class channel c while the rest of the channels are zero
diction is a probability for each pixel denoted by pi,c ∈ [0, 1]. Then          (yijkc = 0 for c0 6= c). We use S(x) ∈ [0, 1][H,W,C] to denote the
the Dice Score of the prediction in the spirit of the above description         class probabilities predicted by S at each pixel location such that the

                                                                            2
class probabilities normalize to 1 at each pixel. Let D(xi , y) be the           this dataset, only lung segmentation masks are publicly available.
scalar probability estimate of y coming from the training data. They             The Shenzhen dataset contains a total of 662 chest X-rays, of which
defined the optimization problem as                                              326 are of healthy patients, and in a similar fashion, 336 are of
                                                                                 patients with tuberculosis. The images vary in sizes, but all are of
                    N
        n           X                                                            high resolution, with 8-bit grayscale levels. Only lung segmentation
 min max J(S, D) :=   Js (S(xi ), yi )                                           masks are publicly available for the dataset.
    S    D
                               i=1                                    (1)
                  h                                            io
               − λ Jd (D(xi , yi ), 1) + Jd (D(xi , S(xi )), 0) ,
                                                                                 3.1    Preprocessing Data
where                                                                            X-rays are grayscale images with typically low contrast, which
                                             C
                              1 XX                                               makes their analysis a difficult task. This obstacle might be over-
               Js (ŷ, y) :=         −y jkc ln y jkc
                             HW                                                  come by using some sort of histogram equalization technique. The
                                 c=1   j,k
                                                                                 idea of standard histogram equalization is spreading out the the most
is the multiclass cross-entropy loss for predicted mask ŷ averaged              frequent intensity values to a higher range of the intensity domain
over all pixels.                                                                 [0, 255] by modifying the intensities so that their cumulative distri-
                                                                                 bution function (CDF) on the complete modified image is as close
                   Jd (t̂, t) := −t ln t̂ + (1 − t) ln(1 − t̂)                   to the CDF of the uniform distribution as possible. Improvements
is the binary logistic loss for the critic’s prediction. λ is a tuning pa-       might be made by using adaptive histogram equalization, in which
rameter balancing pixel-wise loss and the adversarial loss. We can               the above method is not utilized globally, but separately on pieces of
solve equation (1) by alternate between optimizing S and optimiz-                the image, in order to enhance local contrasts. However, this tech-
ing D using their respective loss functions. This is a point where               nique might overamplify noise in near-constant regions, hence our
we introduced a modification: instead of using the multiclass cross-             choice was to use Contrast Limited Adaptive Histogram Equalization
entropy loss Js (ŷ, y) in the first term, we applied the Focal Tversky          (CLAHE), which counteracts this effect by clipping the histogram at
Loss F T L(ŷ, y).                                                               a predefined value before calculating the CDF, and redistribute this
   Now since the first term in equation (1) does not depend on D, we             part of the image equally among all the histogram bins.
can train our critic network by minimizing the following objective               Applying CLAHE to an X-ray image has visually appealing results,
with respect to D for a fixed S:                                                 as displayed in Figure 3. As our experiments displayed, it does not
                                                                                 merely help human vision, but also neural networks.
             N
             X
                   Jd (D(xi , yi ), 1) + Jd (D(xi , S(xi )), 0).
             i=1

Moreover, given a fixed D, we train the segmentation network by
minimizing the following objective with respect to S:
             N
             X
                   F T L(S(xi ), yi ) + λJd (D(xi , S(xi )), 1).
             i=1

Following the recommendation in [3], we use Jd (D(xi , S(xi )), 1)
in place of −Jd (D(xi , S(x)), 0), as it leads to stronger gradient sig-
nals. After tests on the value of λ we decided to use λ = 0.1.
   Concerning training schedule, we found that following pretraining                 Figure 3. Example of chest X-ray images before and after CLAHE
the generator for 50 epochs, we can train the adversarial network for
50 epochs, in which we perform 1 optimization step on the critic
network after each 5 optimization step on the generator. This choice
of balance is also borrowed from [13], however, we note that the                    The images were then resized to 512x512 resolution and mapped
training of our network is much faster.                                          to [−1, 1] before being fed to our network.


3       DATASETS                                                                 4     EXPERIMENTS AND RESULTS
For training- and validation data, we used the Japanese Society                  The aforementioned Attention U-Net architecture was implemented
of Radiological Technology (JSRT) dataset [10] , as well as the                  using Keras-TensorFlow Python libraries, to which we have fed our
Montgomery- and Shenzhen dataset [6], all of which are public                    dataset and trained for 40 epochs with 8 X-ray scans in each batch.
datasets of chest X-rays with available organ segmentation masks                 Our optimizer of choice was stochastic gradient descent, having
reviewed by expert radiologists.                                                 found that Adam failed to converge in many cases. As loss function,
The JSRT dataset contains a total of 247 images, of which 154 con-               we applied Focal Tversky Loss.
tains lung nodules. The X-rays are all in 2048 × 2048 resolution,                   We have found that applying various data augmentation tech-
and have 12-bit grayscale levels. Both lung and heart segmentation               niques such as flipping, rotating, shearing the image as well as in-
masks are available for this dataset.                                            creasing or decreasing the brightness of the image were of no help
The Montgomery dataset contains 138 chest X-rays, of which 80                    and just resulted in slower convergence.
X-rays are from healthy patients, and 58 are from patients with tu-                 Using the Attention U-Net infrastructure, we managed to reach
berculosis. The X-rays have either a resolution of 4020 × 4892 or                a dice score of 0.9628 for the lungs. Unlike in [13], where no
4892 × 4020, and have 12-bit grayscale levels as well. In the case of            major preprocessing was done, with our preprocessing method, the

                                                                             3
                                                                                    [3] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David
                                                                                        Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio,
                                                                                        ‘Generative adversarial nets’, in Advances in neural information pro-
                                                                                        cessing systems, pp. 2672–2680, (2014).
                                                                                    [4] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick, ‘Mask
                                                                                        R-CNN’, in Proceedings of the IEEE international conference on com-
                                                                                        puter vision, pp. 2961–2969, (2017).
                                                                                    [5] Qinhua Hu, Luı́s Fabrı́cio de F Souza, Gabriel Bandeira Holanda,
                                                                                        Shara SA Alves, Francisco Hércules dos S Silva, Tao Han, and Pe-
                                                                                        dro P Rebouças Filho, ‘An effective approach for CT lung segmenta-
                                                                                        tion using mask region-based convolutional neural networks’, Artificial
                                                                                        Intelligence in Medicine, 101792, (2020).
                                                                                    [6] Stefan Jaeger, Sema Candemir, Sameer Antani, Yı̀-Xiáng J Wáng, Pu-
                                                                                        Xuan Lu, and George Thoma, ‘Two public chest X-ray datasets for
                                                                                        computer-aided screening of pulmonary diseases’, Quantitative imag-
                                                                                        ing in medicine and surgery, 4(6), 475, (2014).
                                                                                    [7] Jonathan Long, Evan Shelhamer, and Trevor Darrell, ‘Fully convolu-
                                                                                        tional networks for semantic segmentation’, in Proceedings of the IEEE
                Figure 4. Epoch-wise dice score coefficient                             conference on computer vision and pattern recognition, pp. 3431–3440,
                                                                                        (2015).
                                                                                    [8] Pauline Luc, Camille Couprie, Soumith Chintala, and Jakob Verbeek,
                                                                                        ‘Semantic segmentation using adversarial networks’, arXiv preprint
                                                                                        arXiv:1611.08408, (2016).
    Table 1. Dice scores of different architectures over different datasets.        [9] Ozan Oktay, Jo Schlemper, Loic Le Folgoc, Matthew Lee, Mattias
                                                                                        Heinrich, Kazunari Misawa, Kensaku Mori, Steven McDonagh, Nils Y
     Dataset            SCAN          ATTN U-Net       Ours (Adv. ATTN)                 Hammerla, Bernhard Kainz, et al., ‘Attention U-Net: Learning where
                                                                                        to look for the pancreas’, arXiv preprint arXiv:1804.03999, (2018).
       JSRT           97.3 ±0.8%       96.3 ±0.7%          97.6 ±0.5%              [10] Junji Shiraishi, Shigehiko Katsuragawa, Junpei Ikezoe, Tsuneo Mat-
       All                 -           95.8 ±0.4%          96.2 ±0.4%                   sumoto, Takeshi Kobayashi, Ken-ichi Komatsu, Mitate Matsui, Hiroshi
       All / JSRT          -            96.6 ±0.6          97.8 ±0.6%                   Fujita, Yoshie Kodera, and Kunio Doi, ‘Development of a digital image
                                                                                        database for chest radiographs with and without a lung nodule: receiver
                                                                                        operating characteristic analysis of radiologists’ detection of pulmonary
network performed very well even if the test- and the validation                        nodules’, American Journal of Roentgenology, 174(1), 71–74, (2000).
sets were of different datasets. This is extremely important for                   [11] Amos Tversky, ‘Features of similarity.’, Psychological review, 84(4),
                                                                                        327, (1977).
real world applications, as X-ray images of different machines are                 [12] Jie Wang, Zhigang Li, Rui Jiang, and Zhen Xie, ‘Instance segmentation
significantly different, largely dependent on the specific calibration                  of anatomical structures in chest radiographs’, in 2019 IEEE 32nd In-
of each machine, thus it is no trivial task to have X-rays accurately                   ternational Symposium on Computer-Based Medical Systems (CBMS),
evaluated that are from machines from which no images were in the                       pp. 441–446. IEEE, (2019).
                                                                                   [13] Nanqing Dong Wei Dai B, Zeya Wang, Xiaodan Liang, Hao Zhang,
training set.                                                                           and Eric P Xing, ‘SCAN: Structure correcting adversarial network for
                                                                                        organ segmentation in chest X-rays’, in Deep Learning in Medical
   We note that even though introducing the adversarial scheme in                       Image Analysis and Multimodal Learning for Clinical Decision Sup-
our setting increased the dice scores, the improvement was not as                       port: 4th International Workshop, DLMIA 2018, and 8th International
drastic as in the case of the FCN and SCAN. By checking the masks                       Workshop, ML-CDS 2018, Held in Conjunction with MICCAI 2018,
                                                                                        Granada, Spain, September 20, 2018, Proceedings, volume 11045, p.
generated by the vanilla Attention U-Net, we found that this phe-                       263. Springer, (2018).
nomenon can be attributed to the fact that while the FCN occasion-
ally produces abnormally shaped masks, due to our preprocessing
steps the Attention U-Net does not commit this mistake. Conse-
quently, the adversarial scheme is responsible for subtle shape im-
provements only, which is indicated by the Dice Score less spectac-
ularly.


5     FUTURE WORK
So far we have not experimented with the architecture of the critic
network, we found the performance of the architecture in [13] com-
pletely satisfying. However, it would be desirable to carry out further
tests in this direction in order to achieve better understanding of the
role of adversarial scheme.


REFERENCES
[1] Nabila Abraham and Naimul Mefraz Khan, ‘A novel focal Tversky loss
    function with improved attention U-Net for lesion segmentation’, in
    2019 IEEE 16th International Symposium on Biomedical Imaging (ISBI
    2019), pp. 683–687. IEEE, (2019).
[2] NHS England and NHS Improvement, ‘Diagnostic imaging dataset sta-
    tistical release’, (2019).


                                                                               4