Attention U-Net Based Adversarial Architectures for Chest X-ray Lung Segmentation Gusztáv Gaál 1 and Balázs Maga 2 and András Lukács 3 Abstract. X-ray is by far the most common among medical imag- instance segmentation on chest X-rays and obtained state-of-the-art ing modalities, being faster, more accessible, and more cost-effective results [12, 5]. compared to other radiographic methods. Chest X-ray (CXR) is the most commonly requested test due to its contribution to the early de- tection of lung cancer. The most important biomarker in detecting 2 DEEP LEARNING APPROACH cancer of the lung are nodules, and in finding those, lung segmen- 2.1 Network Architecture tation of chest X-rays is essential. Another goal is interpretability, helping radiologists integrate computer-aided detection methods into Our goal is to produce accurate organ segmentation masks on chest their diagnostic pipeline, greatly reducing their workload. For this X-rays, meaning for input images we want pixel-wise dense predic- reason, a robust algorithm to perform this otherwise arduous seg- tions regarding if the given pixel is either part of the left lung, the mentation task is much desired in the field of medical imaging. In right lung, the heart, or none of the above. this work, we present a novel deep learning approach that uses state- For this purpose Fully Convolutional Networks (FCNs) are known to of-the-art fully convolutional neural networks in conjunction with an significantly outperform other widely used registration-based meth- adversarial critic model. Our network generalized well to CXR im- ods. Specifically we applied a U-Net architecture, thus enabling us ages of unseen datasets with different patient profiles, achieving a to efficiently compute the segmentation mask in the same resolution final DSC of 97.5% on the JSRT CXR dataset. as the input images. The fully convolutional architecture also enables the use images of different resolutions, since unlike standard convo- lutional networks, FCNs don’t contain input-size dependent layers. 1 INTRODUCTION In [9] it has been shown that for medical image analysis tasks the integration of the proposed Attention Gates (AGs) improved the ac- X-ray is the most commonly performed radiographic examination, curacy of the segmentation models, while preserving computational being significantly easier to access, cheaper and faster to carry out efficiency. The architecture of the proposed Attention U-Net is de- than computed tomography (CT), diagnostic ultrasound and mag- scribed by Figure 1. Without the use of AGs, it’s common practice netic resonance imaging (MRI), as well as having lower dose of to use cascade CNNs, selecting a Region Of Interest (ROI) with an- radiation compared to a CT scan. According to the publicly avail- other CNN where the target organ is likely contained. With the use of able, official data of the National Health Service ([2]), in the period AGs we eliminate the need for such a preselecting network, instead from February 2017 to February 2018, the count of imaging activity the Attention U-Net learns to focus on most important local features, was about 41 million in England, out of which almost 22 million was and dulls down the less relevant ones. We note that the dulling of less plain X-ray. Many of these imaging tests might contribute to early di- relevant local features also result in decreased false positive rates. agnosis of cancer, amongst which chest X-ray is the most commonly requested one by general practitioners. In order to identify lung nod- ules, lung segmentation of chest X-rays is essential, and this step is vital in other diagnostic pipelines as well, such as calculating the cardiothoracic ratio, which is the primary indicator of cardiomegaly. For this reason, a robust algorithm to perform this otherwise arduous segmentation task is much desired in the field of medical imaging. Semantic segmentation aims to solve the challenging problem of assigning a pre-defined class to each pixel of the image. This task requires a high level of visual understanding, in which state-of-the- art performance is attained by methods utilizing Fully Convolutional Networks (FCN) [7]. In [8], adversarial training is used to enhance segmentation of colored images. This idea was incorporated to [13] in order to segment chest X-rays with a fully convolutional, resid- Figure 1. Schematic architecture of the Attention U-Net [9] ual neural network. Recently, Mask R-CNN [4] is utilized to realize 1 Eötvös Loránd University, Hungary, email: guzzzti@gmail.com 2 Eötvös Loránd University, Hungary, email: mbalazs0701@gmail.com In order to enhance the performance of Attention U-Net, we fur- 3 Eötvös Loránd University, Hungary, email: lukacs@cs.elte.hu ther experimented with adversarial techniques, motivated by [13]. In Copyright ©2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0) that work, the authors first designed a Fully Convolutional Network is defined to be (FCN) for the lung segmentation task, and noted that in certain cases N the network tends to segment abnormal and incorrect organ shapes. X pi,c gi,c + ε For example, the apex of the ribcage might be mistaken as an inter- i=1 nal rib bone, resulting in the mask “bleeding out” to the background, DSC = N , X which has similar intensity as the lung field. To address this issue, (pi,c + gi,c ) + ε they developed an adversarial scheme, leading to a model which they i=1 call Structure Correcting Adversarial Network (SCAN). This archi- where N is the total number of pixels, and ε is introduced for the sake of numerical stability and to avoid divison by 0. The linear Dice Loss (DL) of the multiclass prediction is then X DL = (1 − DSCc ) . c A deficiency of Dice Loss is that it penalizes false negative and false positive predictions equally, which results in high precision but low recall. For example practice shows that if the region of interests (ROI) are small, false negative pixels need to have a higher weight than false positive ones. Mathematically this obstacle is easily over- come by introducing weights α, β as tuneable parameters, resulting in the definition of Tversky similarity index [11]: Figure 2. Schematic architecture of the Structure Correcting Adversarial N Networks [13] X pi,c gi,c + ε i=1 T Ic = N N N , tecture is based on the idea of the General Adversarial Networks [3]. X X X pi,c gi,c + α pi,c gi,c + β pi,c gi,c + ε They use the pretrained Fully Convolutional Network as a generator i=1 i=1 i=1 of a General Adversarial Network, and they also train a critic net- work which is fed the ground truth mask, the predicted mask and op- where pi,c = 1 − pi,c and gi,c = 1 − gi,c , that is the overline simply tionally the original image. The critic network has roughly the same stands for describing the complement of the class. architecture, resulting in similar capacity. This approach forces the Tversky Loss is obtained from Tversky index as Dice Loss was ob- generator to segment more realistic masks, eventually removing ob- tained from Dice Score Coefficient: viously wrong shapes. X In our work, besides the standard Attention U-Net, we also created TL = (1 − T Ic ) . c a network of analogous structure, in which the FCN used in [13] is replaced by the Attention U-Net. We did not introduce any modifi- Another issue with the Dice Loss is that it struggles to segment small cation in the critic model design, such experiments are left to future ROIs as they do not contribute to the loss significantly. This difficulty work. was addressed in [1], where the authors introduced the quantity Focal Tversky Loss in order to improve the performance of their lesion segmentation model: 2.2 Tversky Loss X −1 In the field of medical imaging, Dice Score Coefficient (DSC) is FTL = (1 − T Ic )γ , c probably the most widespread and simple way to measure the overlap ratio of the masks and the ground truth, and hence to compare and where γ ∈ [1, 3]. In practice, if a pixel with is misclassified with a evaluate segmentations. Given two sets of pixels X, Y , their DSC is high Tversky index, the Focal Tversky Loss is unaffected. However, if the Tversky index is small and the pixel is misclassified, the Focal 2|X ∩ Y | Tversky Loss will decrease significantly. DSC(X, Y ) = . |X| + |Y | If Y is in fact the result of a test about which pixels are in X, we can 2.3 Training rewrite it with the usual notation true/false positive (TP/FP), false The explanation of the training of our structure correcting network negative (FN) to be is a bit longer to explain, we directly follow the footsteps of [13]. 2T P Let S, D be the segmentation network and the critic network, re- DSC(X, Y ) = . spectively. The data consist of the input images xi and the associ- 2T P + F N + F P ated mask labels yi , where xi is of shape [H, W, 1] for a single- We would like to use this concept in our setup. The class c we would channel gray-scale image with height H and width W , and yi is like to segment corresponds to a set, but it is more appropriate to of shape [H, W, C] where C is the number of classes including the consider its indicator function g, that is gi,c ∈ {0, 1} equals 1 if and background. Note that for each pixel location (j, k), yijkc = 1 for only if the ith pixel belongs to the object. On the other hand, our pre- the labeled 0 class channel c while the rest of the channels are zero diction is a probability for each pixel denoted by pi,c ∈ [0, 1]. Then (yijkc = 0 for c0 6= c). We use S(x) ∈ [0, 1][H,W,C] to denote the the Dice Score of the prediction in the spirit of the above description class probabilities predicted by S at each pixel location such that the 2 class probabilities normalize to 1 at each pixel. Let D(xi , y) be the this dataset, only lung segmentation masks are publicly available. scalar probability estimate of y coming from the training data. They The Shenzhen dataset contains a total of 662 chest X-rays, of which defined the optimization problem as 326 are of healthy patients, and in a similar fashion, 336 are of patients with tuberculosis. The images vary in sizes, but all are of N n X high resolution, with 8-bit grayscale levels. Only lung segmentation min max J(S, D) := Js (S(xi ), yi ) masks are publicly available for the dataset. S D i=1 (1) h io − λ Jd (D(xi , yi ), 1) + Jd (D(xi , S(xi )), 0) , 3.1 Preprocessing Data where X-rays are grayscale images with typically low contrast, which C 1 XX makes their analysis a difficult task. This obstacle might be over- Js (ŷ, y) := −y jkc ln y jkc HW come by using some sort of histogram equalization technique. The c=1 j,k idea of standard histogram equalization is spreading out the the most is the multiclass cross-entropy loss for predicted mask ŷ averaged frequent intensity values to a higher range of the intensity domain over all pixels. [0, 255] by modifying the intensities so that their cumulative distri- bution function (CDF) on the complete modified image is as close Jd (t̂, t) := −t ln t̂ + (1 − t) ln(1 − t̂) to the CDF of the uniform distribution as possible. Improvements is the binary logistic loss for the critic’s prediction. λ is a tuning pa- might be made by using adaptive histogram equalization, in which rameter balancing pixel-wise loss and the adversarial loss. We can the above method is not utilized globally, but separately on pieces of solve equation (1) by alternate between optimizing S and optimiz- the image, in order to enhance local contrasts. However, this tech- ing D using their respective loss functions. This is a point where nique might overamplify noise in near-constant regions, hence our we introduced a modification: instead of using the multiclass cross- choice was to use Contrast Limited Adaptive Histogram Equalization entropy loss Js (ŷ, y) in the first term, we applied the Focal Tversky (CLAHE), which counteracts this effect by clipping the histogram at Loss F T L(ŷ, y). a predefined value before calculating the CDF, and redistribute this Now since the first term in equation (1) does not depend on D, we part of the image equally among all the histogram bins. can train our critic network by minimizing the following objective Applying CLAHE to an X-ray image has visually appealing results, with respect to D for a fixed S: as displayed in Figure 3. As our experiments displayed, it does not merely help human vision, but also neural networks. N X Jd (D(xi , yi ), 1) + Jd (D(xi , S(xi )), 0). i=1 Moreover, given a fixed D, we train the segmentation network by minimizing the following objective with respect to S: N X F T L(S(xi ), yi ) + λJd (D(xi , S(xi )), 1). i=1 Following the recommendation in [3], we use Jd (D(xi , S(xi )), 1) in place of −Jd (D(xi , S(x)), 0), as it leads to stronger gradient sig- nals. After tests on the value of λ we decided to use λ = 0.1. Concerning training schedule, we found that following pretraining Figure 3. Example of chest X-ray images before and after CLAHE the generator for 50 epochs, we can train the adversarial network for 50 epochs, in which we perform 1 optimization step on the critic network after each 5 optimization step on the generator. This choice of balance is also borrowed from [13], however, we note that the The images were then resized to 512x512 resolution and mapped training of our network is much faster. to [−1, 1] before being fed to our network. 3 DATASETS 4 EXPERIMENTS AND RESULTS For training- and validation data, we used the Japanese Society The aforementioned Attention U-Net architecture was implemented of Radiological Technology (JSRT) dataset [10] , as well as the using Keras-TensorFlow Python libraries, to which we have fed our Montgomery- and Shenzhen dataset [6], all of which are public dataset and trained for 40 epochs with 8 X-ray scans in each batch. datasets of chest X-rays with available organ segmentation masks Our optimizer of choice was stochastic gradient descent, having reviewed by expert radiologists. found that Adam failed to converge in many cases. As loss function, The JSRT dataset contains a total of 247 images, of which 154 con- we applied Focal Tversky Loss. tains lung nodules. The X-rays are all in 2048 × 2048 resolution, We have found that applying various data augmentation tech- and have 12-bit grayscale levels. Both lung and heart segmentation niques such as flipping, rotating, shearing the image as well as in- masks are available for this dataset. creasing or decreasing the brightness of the image were of no help The Montgomery dataset contains 138 chest X-rays, of which 80 and just resulted in slower convergence. X-rays are from healthy patients, and 58 are from patients with tu- Using the Attention U-Net infrastructure, we managed to reach berculosis. The X-rays have either a resolution of 4020 × 4892 or a dice score of 0.9628 for the lungs. Unlike in [13], where no 4892 × 4020, and have 12-bit grayscale levels as well. In the case of major preprocessing was done, with our preprocessing method, the 3 [3] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio, ‘Generative adversarial nets’, in Advances in neural information pro- cessing systems, pp. 2672–2680, (2014). [4] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick, ‘Mask R-CNN’, in Proceedings of the IEEE international conference on com- puter vision, pp. 2961–2969, (2017). [5] Qinhua Hu, Luı́s Fabrı́cio de F Souza, Gabriel Bandeira Holanda, Shara SA Alves, Francisco Hércules dos S Silva, Tao Han, and Pe- dro P Rebouças Filho, ‘An effective approach for CT lung segmenta- tion using mask region-based convolutional neural networks’, Artificial Intelligence in Medicine, 101792, (2020). [6] Stefan Jaeger, Sema Candemir, Sameer Antani, Yı̀-Xiáng J Wáng, Pu- Xuan Lu, and George Thoma, ‘Two public chest X-ray datasets for computer-aided screening of pulmonary diseases’, Quantitative imag- ing in medicine and surgery, 4(6), 475, (2014). [7] Jonathan Long, Evan Shelhamer, and Trevor Darrell, ‘Fully convolu- tional networks for semantic segmentation’, in Proceedings of the IEEE Figure 4. Epoch-wise dice score coefficient conference on computer vision and pattern recognition, pp. 3431–3440, (2015). [8] Pauline Luc, Camille Couprie, Soumith Chintala, and Jakob Verbeek, ‘Semantic segmentation using adversarial networks’, arXiv preprint arXiv:1611.08408, (2016). Table 1. Dice scores of different architectures over different datasets. [9] Ozan Oktay, Jo Schlemper, Loic Le Folgoc, Matthew Lee, Mattias Heinrich, Kazunari Misawa, Kensaku Mori, Steven McDonagh, Nils Y Dataset SCAN ATTN U-Net Ours (Adv. ATTN) Hammerla, Bernhard Kainz, et al., ‘Attention U-Net: Learning where to look for the pancreas’, arXiv preprint arXiv:1804.03999, (2018). JSRT 97.3 ±0.8% 96.3 ±0.7% 97.6 ±0.5% [10] Junji Shiraishi, Shigehiko Katsuragawa, Junpei Ikezoe, Tsuneo Mat- All - 95.8 ±0.4% 96.2 ±0.4% sumoto, Takeshi Kobayashi, Ken-ichi Komatsu, Mitate Matsui, Hiroshi All / JSRT - 96.6 ±0.6 97.8 ±0.6% Fujita, Yoshie Kodera, and Kunio Doi, ‘Development of a digital image database for chest radiographs with and without a lung nodule: receiver operating characteristic analysis of radiologists’ detection of pulmonary network performed very well even if the test- and the validation nodules’, American Journal of Roentgenology, 174(1), 71–74, (2000). sets were of different datasets. This is extremely important for [11] Amos Tversky, ‘Features of similarity.’, Psychological review, 84(4), 327, (1977). real world applications, as X-ray images of different machines are [12] Jie Wang, Zhigang Li, Rui Jiang, and Zhen Xie, ‘Instance segmentation significantly different, largely dependent on the specific calibration of anatomical structures in chest radiographs’, in 2019 IEEE 32nd In- of each machine, thus it is no trivial task to have X-rays accurately ternational Symposium on Computer-Based Medical Systems (CBMS), evaluated that are from machines from which no images were in the pp. 441–446. IEEE, (2019). [13] Nanqing Dong Wei Dai B, Zeya Wang, Xiaodan Liang, Hao Zhang, training set. and Eric P Xing, ‘SCAN: Structure correcting adversarial network for organ segmentation in chest X-rays’, in Deep Learning in Medical We note that even though introducing the adversarial scheme in Image Analysis and Multimodal Learning for Clinical Decision Sup- our setting increased the dice scores, the improvement was not as port: 4th International Workshop, DLMIA 2018, and 8th International drastic as in the case of the FCN and SCAN. By checking the masks Workshop, ML-CDS 2018, Held in Conjunction with MICCAI 2018, Granada, Spain, September 20, 2018, Proceedings, volume 11045, p. generated by the vanilla Attention U-Net, we found that this phe- 263. Springer, (2018). nomenon can be attributed to the fact that while the FCN occasion- ally produces abnormally shaped masks, due to our preprocessing steps the Attention U-Net does not commit this mistake. Conse- quently, the adversarial scheme is responsible for subtle shape im- provements only, which is indicated by the Dice Score less spectac- ularly. 5 FUTURE WORK So far we have not experimented with the architecture of the critic network, we found the performance of the architecture in [13] com- pletely satisfying. However, it would be desirable to carry out further tests in this direction in order to achieve better understanding of the role of adversarial scheme. REFERENCES [1] Nabila Abraham and Naimul Mefraz Khan, ‘A novel focal Tversky loss function with improved attention U-Net for lesion segmentation’, in 2019 IEEE 16th International Symposium on Biomedical Imaging (ISBI 2019), pp. 683–687. IEEE, (2019). [2] NHS England and NHS Improvement, ‘Diagnostic imaging dataset sta- tistical release’, (2019). 4