INTRODUCTION

Attention U-Net Based Adversarial Architectures for Chest X-ray Lung Segmentation

Guszt a´v Ga a´l

Bala´ zs Maga

X-ray is by far the most common among medical imaging modalities, being faster, more accessible, and more cost-effective compared to other radiographic methods. Chest X-ray (CXR) is the most commonly requested test due to its contribution to the early detection of lung cancer. The most important biomarker in detecting cancer of the lung are nodules, and in finding those, lung segmentation of chest X-rays is essential. Another goal is interpretability, helping radiologists integrate computer-aided detection methods into their diagnostic pipeline, greatly reducing their workload. For this reason, a robust algorithm to perform this otherwise arduous segmentation task is much desired in the field of medical imaging. In this work, we present a novel deep learning approach that uses stateof-the-art fully convolutional neural networks in conjunction with an adversarial critic model. Our network generalized well to CXR images of unseen datasets with different patient profiles, achieving a final DSC of 97.5% on the JSRT CXR dataset.

INTRODUCTION

X-ray is the most commonly performed radiographic examination, being significantly easier to access, cheaper and faster to carry out than computed tomography (CT), diagnostic ultrasound and magnetic resonance imaging (MRI), as well as having lower dose of radiation compared to a CT scan. According to the publicly available, official data of the National Health Service ([ 2 ]), in the period from February 2017 to February 2018, the count of imaging activity was about 41 million in England, out of which almost 22 million was plain X-ray. Many of these imaging tests might contribute to early diagnosis of cancer, amongst which chest X-ray is the most commonly requested one by general practitioners. In order to identify lung nodules, lung segmentation of chest X-rays is essential, and this step is vital in other diagnostic pipelines as well, such as calculating the cardiothoracic ratio, which is the primary indicator of cardiomegaly. For this reason, a robust algorithm to perform this otherwise arduous segmentation task is much desired in the field of medical imaging.

Semantic segmentation aims to solve the challenging problem of assigning a pre-defined class to each pixel of the image. This task requires a high level of visual understanding, in which state-of-theart performance is attained by methods utilizing Fully Convolutional Networks (FCN) [ 7 ]. In [ 8 ], adversarial training is used to enhance segmentation of colored images. This idea was incorporated to [ 13 ] in order to segment chest X-rays with a fully convolutional, residual neural network. Recently, Mask R-CNN [ 4 ] is utilized to realize instance segmentation on chest X-rays and obtained state-of-the-art results [ 12, 5 ]. 2 2.1

DEEP LEARNING APPROACH Network Architecture

Our goal is to produce accurate organ segmentation masks on chest X-rays, meaning for input images we want pixel-wise dense predictions regarding if the given pixel is either part of the left lung, the right lung, the heart, or none of the above.

For this purpose Fully Convolutional Networks (FCNs) are known to significantly outperform other widely used registration-based methods. Specifically we applied a U-Net architecture, thus enabling us to efficiently compute the segmentation mask in the same resolution as the input images. The fully convolutional architecture also enables the use images of different resolutions, since unlike standard convolutional networks, FCNs don’t contain input-size dependent layers. In [ 9 ] it has been shown that for medical image analysis tasks the integration of the proposed Attention Gates (AGs) improved the accuracy of the segmentation models, while preserving computational efficiency. The architecture of the proposed Attention U-Net is described by Figure 1. Without the use of AGs, it’s common practice to use cascade CNNs, selecting a Region Of Interest (ROI) with another CNN where the target organ is likely contained. With the use of AGs we eliminate the need for such a preselecting network, instead the Attention U-Net learns to focus on most important local features, and dulls down the less relevant ones. We note that the dulling of less relevant local features also result in decreased false positive rates. In order to enhance the performance of Attention U-Net, we further experimented with adversarial techniques, motivated by [ 13 ]. In that work, the authors first designed a Fully Convolutional Network (FCN) for the lung segmentation task, and noted that in certain cases the network tends to segment abnormal and incorrect organ shapes. For example, the apex of the ribcage might be mistaken as an internal rib bone, resulting in the mask “bleeding out” to the background, which has similar intensity as the lung field. To address this issue, they developed an adversarial scheme, leading to a model which they call Structure Correcting Adversarial Network (SCAN). This architecture is based on the idea of the General Adversarial Networks [ 3 ]. They use the pretrained Fully Convolutional Network as a generator of a General Adversarial Network, and they also train a critic network which is fed the ground truth mask, the predicted mask and optionally the original image. The critic network has roughly the same architecture, resulting in similar capacity. This approach forces the generator to segment more realistic masks, eventually removing obviously wrong shapes.

In our work, besides the standard Attention U-Net, we also created a network of analogous structure, in which the FCN used in [ 13 ] is replaced by the Attention U-Net. We did not introduce any modification in the critic model design, such experiments are left to future work. 2.2

Tversky Loss

In the field of medical imaging, Dice Score Coefficient (DSC) is probably the most widespread and simple way to measure the overlap ratio of the masks and the ground truth, and hence to compare and evaluate segmentations. Given two sets of pixels X; Y , their DSC is DSC(X; Y ) = 2jX \ Y j : jXj + jY j If Y is in fact the result of a test about which pixels are in X, we can rewrite it with the usual notation true/false positive (TP/FP), false negative (FN) to be

DSC(X; Y ) =

2T P 2T P + F N + F P : We would like to use this concept in our setup. The class c we would like to segment corresponds to a set, but it is more appropriate to consider its indicator function g, that is gi;c 2 f0; 1g equals 1 if and only if the ith pixel belongs to the object. On the other hand, our prediction is a probability for each pixel denoted by pi;c 2 [0; 1]. Then the Dice Score of the prediction in the spirit of the above description

DSC =

N X pi;cgi;c + " i=1 N X (pi;c + gi;c) + " i=1 ; where N is the total number of pixels, and " is introduced for the sake of numerical stability and to avoid divison by 0. The linear Dice Loss (DL) of the multiclass prediction is then

X (1 c DL =

DSCc) : A deficiency of Dice Loss is that it penalizes false negative and false positive predictions equally, which results in high precision but low recall. For example practice shows that if the region of interests (ROI) are small, false negative pixels need to have a higher weight than false positive ones. Mathematically this obstacle is easily overcome by introducing weights ; as tuneable parameters, resulting in the definition of Tversky similarity index [ 11 ]:

T Ic =

N X pi;cgi;c + i=1

N X pi;cgi;c + " i=1 N X pi;cgi;c + i=1

N X pi;cgi;c + " i=1 ; where pi;c = 1 pi;c and gi;c = 1 gi;c, that is the overline simply stands for describing the complement of the class.

Tversky Loss is obtained from Tversky index as Dice Loss was obtained from Dice Score Coefficient:

T L =

X (1 c

T Ic) : Another issue with the Dice Loss is that it struggles to segment small ROIs as they do not contribute to the loss significantly. This difficulty was addressed in [ 1 ], where the authors introduced the quantity Focal Tversky Loss in order to improve the performance of their lesion segmentation model:

F T L =

X (1 c

T Ic) 1 ; where 2 [1; 3]. In practice, if a pixel with is misclassified with a high Tversky index, the Focal Tversky Loss is unaffected. However, if the Tversky index is small and the pixel is misclassified, the Focal Tversky Loss will decrease significantly. 2.3

Training

The explanation of the training of our structure correcting network is a bit longer to explain, we directly follow the footsteps of [ 13 ]. Let S, D be the segmentation network and the critic network, respectively. The data consist of the input images xi and the associated mask labels yi, where xi is of shape [H; W; 1] for a singlechannel gray-scale image with height H and width W , and yi is of shape [H; W; C] where C is the number of classes including the background. Note that for each pixel location (j; k), yijkc = 1 for the labeled class channel c while the rest of the channels are zero (yijkc0 = 0 for c0 6= c). We use S(x) 2 [0; 1][H;W;C] to denote the class probabilities predicted by S at each pixel location such that the class probabilities normalize to 1 at each pixel. Let D(xi; y) be the scalar probability estimate of y coming from the training data. They defined the optimization problem as min maxnJ (S; D) :=

S D

N X Js(S(xi); yi) i=1 (1) h

Jd(D(xi; yi); 1) + Jd(D(xi; S(xi)); 0)io; where

1 HW

C X X j;k c=1 Js(y^; y) := yjkc ln yjkc is the multiclass cross-entropy loss for predicted mask y^ averaged over all pixels.

Jd(t^; t) := t ln t^+ (1 t) ln(1 ^ t) is the binary logistic loss for the critic’s prediction. is a tuning parameter balancing pixel-wise loss and the adversarial loss. We can solve equation (1) by alternate between optimizing S and optimizing D using their respective loss functions. This is a point where we introduced a modification: instead of using the multiclass crossentropy loss Js(y^; y) in the first term, we applied the Focal Tversky Loss F T L(y^; y).

Now since the first term in equation (1) does not depend on D, we can train our critic network by minimizing the following objective with respect to D for a fixed S:

N X Jd(D(xi; yi); 1) + Jd(D(xi; S(xi)); 0): i=1 Moreover, given a fixed D, we train the segmentation network by minimizing the following objective with respect to S:

N X F T L(S(xi); yi) + i=1

Jd(D(xi; S(xi)); 1): Following the recommendation in [ 3 ], we use Jd(D(xi; S(xi)); 1) in place of Jd(D(xi; S(x)); 0), as it leads to stronger gradient signals. After tests on the value of we decided to use = 0:1.

Concerning training schedule, we found that following pretraining the generator for 50 epochs, we can train the adversarial network for 50 epochs, in which we perform 1 optimization step on the critic network after each 5 optimization step on the generator. This choice of balance is also borrowed from [ 13 ], however, we note that the training of our network is much faster. 3

DATASETS

For training- and validation data, we used the Japanese Society of Radiological Technology (JSRT) dataset [ 10 ] , as well as the Montgomery- and Shenzhen dataset [ 6 ], all of which are public datasets of chest X-rays with available organ segmentation masks reviewed by expert radiologists.

The JSRT dataset contains a total of 247 images, of which 154 contains lung nodules. The X-rays are all in 2048 2048 resolution, and have 12-bit grayscale levels. Both lung and heart segmentation masks are available for this dataset.

The Montgomery dataset contains 138 chest X-rays, of which 80 X-rays are from healthy patients, and 58 are from patients with tuberculosis. The X-rays have either a resolution of 4020 4892 or 4892 4020, and have 12-bit grayscale levels as well. In the case of this dataset, only lung segmentation masks are publicly available. The Shenzhen dataset contains a total of 662 chest X-rays, of which 326 are of healthy patients, and in a similar fashion, 336 are of patients with tuberculosis. The images vary in sizes, but all are of high resolution, with 8-bit grayscale levels. Only lung segmentation masks are publicly available for the dataset. 3.1

Preprocessing Data

X-rays are grayscale images with typically low contrast, which makes their analysis a difficult task. This obstacle might be overcome by using some sort of histogram equalization technique. The idea of standard histogram equalization is spreading out the the most frequent intensity values to a higher range of the intensity domain [0; 255] by modifying the intensities so that their cumulative distribution function (CDF) on the complete modified image is as close to the CDF of the uniform distribution as possible. Improvements might be made by using adaptive histogram equalization, in which the above method is not utilized globally, but separately on pieces of the image, in order to enhance local contrasts. However, this technique might overamplify noise in near-constant regions, hence our choice was to use Contrast Limited Adaptive Histogram Equalization (CLAHE), which counteracts this effect by clipping the histogram at a predefined value before calculating the CDF, and redistribute this part of the image equally among all the histogram bins. Applying CLAHE to an X-ray image has visually appealing results, as displayed in Figure 3. As our experiments displayed, it does not merely help human vision, but also neural networks. The images were then resized to 512x512 resolution and mapped to [ 1; 1] before being fed to our network. 4

EXPERIMENTS AND RESULTS

The aforementioned Attention U-Net architecture was implemented using Keras-TensorFlow Python libraries, to which we have fed our dataset and trained for 40 epochs with 8 X-ray scans in each batch. Our optimizer of choice was stochastic gradient descent, having found that Adam failed to converge in many cases. As loss function, we applied Focal Tversky Loss.

We have found that applying various data augmentation techniques such as flipping, rotating, shearing the image as well as increasing or decreasing the brightness of the image were of no help and just resulted in slower convergence.

Using the Attention U-Net infrastructure, we managed to reach a dice score of 0.9628 for the lungs. Unlike in [ 13 ], where no major preprocessing was done, with our preprocessing method, the 97.3 0.8%

ATTN U-Net

Ours (Adv. ATTN) network performed very well even if the test- and the validation sets were of different datasets. This is extremely important for real world applications, as X-ray images of different machines are significantly different, largely dependent on the specific calibration of each machine, thus it is no trivial task to have X-rays accurately evaluated that are from machines from which no images were in the training set.

We note that even though introducing the adversarial scheme in our setting increased the dice scores, the improvement was not as drastic as in the case of the FCN and SCAN. By checking the masks generated by the vanilla Attention U-Net, we found that this phenomenon can be attributed to the fact that while the FCN occasionally produces abnormally shaped masks, due to our preprocessing steps the Attention U-Net does not commit this mistake. Consequently, the adversarial scheme is responsible for subtle shape improvements only, which is indicated by the Dice Score less spectacularly. 5

FUTURE WORK

So far we have not experimented with the architecture of the critic network, we found the performance of the architecture in [ 13 ] completely satisfying. However, it would be desirable to carry out further tests in this direction in order to achieve better understanding of the role of adversarial scheme.

[1]

Nabila

Abraham and Naimul Mefraz Khan, 'A novel focal Tversky loss function with improved attention U-Net for lesion segmentation' , in 2019 IEEE 16th International Symposium on Biomedical Imaging (ISBI 2019 ), pp. 683 - 687 . IEEE, ( 2019 ).

[2]

NHS

England and NHS Improvement , ' Diagnostic imaging dataset statistical release', ( 2019 ).

[3]

Ian

Goodfellow , Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio, ' Generative adversarial nets' , in Advances in neural information processing systems , pp. 2672 - 2680 , ( 2014 ).

[4]

Kaiming

He , Georgia Gkioxari, Piotr Dolla´r, and Ross Girshick, ' Mask R-CNN', in Proceedings of the IEEE international conference on computer vision , pp. 2961 - 2969 , ( 2017 ).

[5]

Qinhua

Hu , Lu´ıs Fabr´ıcio de F Souza, Gabriel Bandeira Holanda, Shara SA Alves , Francisco He´ rcules dos S Silva, Tao Han, and Pedro P Rebouc¸as Filho, 'An effective approach for CT lung segmentation using mask region-based convolutional neural networks' , Artificial Intelligence in Medicine , 101792 , ( 2020 ).

[6]

Stefan

Jaeger , Sema Candemir, Sameer Antani, Y`ı-Xia´ng J Wa´ng, PuXuan Lu, and George Thoma, ' Two public chest X-ray datasets for computer-aided screening of pulmonary diseases', Quantitative imaging in medicine and surgery, 4 ( 6 ), 475 , ( 2014 ).

[7]

Jonathan

Long , Evan Shelhamer, and Trevor Darrell, ' Fully convolutional networks for semantic segmentation' , in Proceedings of the IEEE conference on computer vision and pattern recognition , pp. 3431 - 3440 , ( 2015 ).

[8]

Pauline

Luc , Camille Couprie, Soumith Chintala, and Jakob Verbeek, ' Semantic segmentation using adversarial networks' , arXiv preprint arXiv:1611.08408 , ( 2016 ).

[9]

Ozan

Oktay , Jo Schlemper, Loic Le Folgoc,

Matthew

Lee , Mattias Heinrich, Kazunari Misawa, Kensaku Mori, Steven

McDonagh

, Nils Y Hammerla,

Bernhard

Kainz , et al., 'Attention U-Net: Learning where to look for the pancreas' , arXiv preprint arXiv: 1804 . 03999 , ( 2018 ).

[10] Junji

Shiraishi

, Shigehiko Katsuragawa, Junpei Ikezoe, Tsuneo Matsumoto, Takeshi Kobayashi, Ken-ichi Komatsu , Mitate Matsui, Hiroshi Fujita, Yoshie Kodera, and Kunio Doi, ' Development of a digital image database for chest radiographs with and without a lung nodule: receiver operating characteristic analysis of radiologists' detection of pulmonary nodules' , American Journal of Roentgenology , 174 ( 1 ), 71 - 74 , ( 2000 ).

[11] Amos

Tversky

, 'Features of similarity.', Psychological review, 84 ( 4 ), 327 , ( 1977 ).

[12] Jie

Wang

Zhigang

Li ,

Rui

Jiang , and Zhen Xie, 'Instance segmentation of anatomical structures in chest radiographs' , in 2019 IEEE 32nd International Symposium on Computer-Based Medical Systems (CBMS) , pp. 441 - 446 . IEEE, ( 2019 ).

[13]

Nanqing

Dong Wei Dai

B , Zeya

Wang , Xiaodan Liang, Hao Zhang, and Eric P Xing, 'SCAN: Structure correcting adversarial network for organ segmentation in chest X-rays', in Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support: 4th International Workshop , DLMIA 2018, and 8th International Workshop, ML-CDS 2018 , Held in Conjunction with MICCAI 2018, Granada , Spain, September 20 , 2018 , Proceedings, volume 11045 , p. 263 . Springer, ( 2018 ).