=Paper= {{Paper |id=Vol-2551/paper-05 |storemode=property |title=Lung Nodule Classification Using Convolutional Autoencoder and Clustering Augmented Learning Method(CALM) |pdfUrl=https://ceur-ws.org/Vol-2551/paper-05.pdf |volume=Vol-2551 |authors=Soumya Suvra Ghosal,Indranil Sarkar,Issmail El Hallaoui |dblpUrl=https://dblp.org/rec/conf/wsdm/GhosalSH20 }} ==Lung Nodule Classification Using Convolutional Autoencoder and Clustering Augmented Learning Method(CALM)== https://ceur-ws.org/Vol-2551/paper-05.pdf
 Lung nodule classification using Convolutional Autoencoder
    and Clustering Augmented Learning Method(CALM)
            Soumya Suvra Ghosal                                            Indranil Sarkar                            Issmail El Hallaoui
        soumyasuvraghosal@gmail.com                             indranil.sarkar.nitdgp@gmail.com                  issmail.elhallaoui@gerad.ca
               NIT Durgapur                                                NIT Durgapur                         Ecole Polytechnique de Montreal
               Durgapur, India                                            Durgapur, India                              Montreal, Canada
ABSTRACT                                                                               study of the correlation between gene expression profiles and disease
Early detection of lung cancer can help in a sharp decrease in the                     states or stages of cells plays an important role in biological and
lung cancer mortality rate, which accounts for more than 17% per-                      clinical applications. The gene expression profiles can be obtained
cent of total cancer-related deaths. A large number of cases are                       from multiple tissue samples and comparing the diseased tissue with
encountered by radiologists daily for initial diagnosis. Computer-                     the normal one. One main challenge in this regard is to determine the
Aided Diagnosis(CAD) systems can assist radiologists by offering a                     difference between cancerous gene expression in tumor cells and the
second opinion and making the whole process faster. However, one                       gene expression in normal, non-cancerous tissues. Many machine
drawback of CAD systems is a large amount of data needed to train                      learning classification techniques and algorithms have been proposed
them, which can be expensive in the medical field.                                     to address this problem. Hence intelligent healthcare systems are an
In this paper, we propose using a generative adversarial network(GAN)                  important research direction to assist doctors in harnessing medical
as a potential data augmentation strategy to generate more training                    big data.
data to improve CAD systems. We also propose a convolutional au-                       And among all types of cancer Lung cancer is harder to detect in
toencoder deep learning framework to support unsupervised image                        early stages as there is only a dime-sized lesion growth known as
features learning for lung nodule through unlabeled data. The paper                    a nodule, inside the lung. By the time when it can be detected, is
also introduces Clustering Augmented Learning Method (CALM)                            already too late for the patient. Also, these small lesions are only
classifier which is based on the concept of simultaneous heteroge-                     detectable by a CT scan.
neous clustering and classification to learn deep feature representa-                  Especially it is difficult to identify the images containing nodules,
tions of the features obtained from Convolutional autoencoder.                         which should be analyzed for assisting early lung cancer diagnosis,
The classification model within CALM consists of a Feedforward                         from a large number of pulmonary CT images. At present, the image
Neural Net (FNN) architecture. To improve the accuracy of the clas-                    analysis methods for assisting radiologists to identify pulmonary
sification model, CALM iterates between clustering and learning to                     nodules consist of four steps:1) region of interest(ROI) definition,
form robust clusters, thereby leveraging the learning process of the                   2) segmentation, 3) hand-crafted features and 4) categorization. In
FNN.                                                                                   particular, radiologist has to spend a lot time on checking each image
Computational experiments using the National Cancer Institute                          for accurately marking the nodule, which is critical for diagnosis
(NCI) Lung Image Database Consortium (LIDC) dataset resulted in                        and is a research hotspot in intelligence healthcare.
an overall accuracy of 95.3% with a precision of 94.9%.                                For example, it is proposed to extract texture features for nodules
                                                                                       analysis, but it is hard to find effective texture feature parameters.
CCS CONCEPTS                                                                           Previously nodules were analyzed by the morphological method
                                                                                       through shape, size, and boundary, etc. However, this analytical ap-
• Computing Methodologies → Machine learning; Feature Selec-
                                                                                       proach is difficult to provide accurate descriptive information. It is
tion; • Information systems → Information systems applications;
                                                                                       because even an experienced radiologist usually gives a vague de-
Data mining; • Applied Computing → Health informatics.
                                                                                       scription based on personal experience and understanding. Therefore,
                                                                                       it is a challenging issue to effectively extract features for represent-
KEYWORDS
                                                                                       ing the nodules.
Convolutional Autoencoder Neural Network, Lung Nodule, Genera-                         Recently CAD systems have taken advantage of the popular Con-
tive Adversarial Networks, Deep Features                                               volutional Neural Network(CNN), producing state of art detection
                                                                                       results, with 95% sensitivity at only 10 false positives per scan.
ACKNOWLEDGEMENT                                                                        However, CNN requires a large amount of training data to learn
This work was presented at the first Health Search and Data Mining                     effectively; in the medical field, obtaining the required data is often
Workshop [5].                                                                          costly, time-consuming, or simply not feasible. To deal with these
                                                                                       issues, data augmentation is often used to better train these CAD
1    INTRODUCTION                                                                      systems.
The use of computer tools, basic machine learning to facilitate and                    In [3], the authors addressed the challenges by training a deep learn-
enhance medical analysis and diagnosis is a promising area. The                        ing architecture based on the Convolutional Autoencoder Neural
                                                                                       Network(CANN) for the classification of pulmonary nodules. In-
Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons   spired by results obtained, we also use a similar architecture for
License Attribution 4.0 International (CC BY 4.0)                                      extracting deep features from CT images. Besides, we present a
                                                                                       Soumya Suvra Ghosal, Indranil Sarkar, and Issmail El Hallaoui


new way to improve lung nodule detection in existing systems by           feature representation of neuro-imaging data of the brain using Deep
augmenting training datasets with the generated image of nodules.         Boltzmann Machine (DBM) diagnosis. The methods achieved a
To create these images, we propose the use of a type of Generative        maximal diagnostic accuracy of 95.52%. In Riccardi et al. [14] the
Adversarial Network (GAN). The augmentation of data would help            authors proposed a new algorithm, which can automatically detect
in more accurate supervised fine-tuning of proposed model.Overall,        nodules with an overall accuracy of 71%. It used 3D radial trans-
the proposed method utilizes both the original and generated image        forms. Kumar et al. [9] proposed to use deep features extracted from
for unsupervised feature learning and some amount of data for fine-       an autoencoder along with a binary decision tree as a classifier to
tuning. Computational experiments show that the proposed method           build their proposed system for lung cancer classification. Wu et al.
is effective to extract image features via a data-driven approach,        [19] proposed deep feature learning for deformable registration of
and achieves faster labeling for medical data. Specifically, the main     brain MR images. They demonstrate that a general approach can be
contributions of this paper are :                                         built to improve image registration by using deep features. Fakoor
                                                                          et al. [6] proposed a method to enhance cancer diagnosis and clas-
    • Application of GANs to augment the training data for computer-
                                                                          sification from gene expression data using unsupervised and deep
      aided lung nodule detection systems and address the issue of
                                                                          learning methods. Their model used PCA (Principal Component
      the insufficiency of training data.
                                                                          Analysis) to achieve dimensionality reduction in case of the very
    • Image features are available to be directly extracted from the
                                                                          high dimensionality of the initial raw feature space. Chuquicusma
      raw image. Such an end-to-end approach does not use an
                                                                          et al. [4] proved in his paper that the GANs are able to generate
      image segmentation method to find the nodules, avoiding loss
                                                                          realistic fake images that fool even experienced radiologists.
      of important information which might affect classification
                                                                          Maayan et al. [7] used GANs to augment liver lesion images to
      results.
                                                                          improve the multiclass CNN classification. He got an increase from
    • The unsupervised data-driven approach can extend to imple-
                                                                          85.7% and 92.4% sensitivity and specificity which is much higher
      ment in other data sets and related applications.
                                                                          as compared to recent state-of-the-art liver classification methods.
    • Devising a classification approach in which data is clustered
                                                                          Zhu et al. [20] showed in his work that Generative Adversarial Net-
      based on their inherent characteristics. In the process of learn-
                                                                          works(GANs) can be used to complement and complete the training
      ing the best clustering solution, the parameters of the classifi-
                                                                          data manifold. It can find better margins between classes. They had
      cation model are optimized, thereby substantially improving
                                                                          done their work by using GANs to augment the emotion categories
      the learning process.
                                                                          that were lacking in face data and they could achieve a 5% to 10%
                                                                          increase in the accuracy of emotion classification.
2   RELATED WORKS
In the past, several methods have been proposed to detect and clas-          In this paper, we propose a convolutional autoencoder unsuper-
sify lung cancer in CT images using a different algorithm. Aliferis       vised learning algorithm for lung CT features learning and CALM
et al. [2] used recursive feature elimination with single variable as-    classifier for pulmonary nodules classification. To tackle the issue
sociation filtering approaches to select a small subset of the gene       of scarcity of medical labeled images, we use a type of Generative
expressions as a reduced feature set. For better classification Ra-       Adversarial Networks(GANs) to augment data to the training set.
maswamy [13] applied recursive feature elimination using SVM to
find similarly a small number of genes. Wang et al. [18] proved that      3 PRELIMINARIES
if the correlation-based feature selector can be combined with a clas-
                                                                          3.1 Generative Adversarial Networks(GAN)
sification approach then it can obtain good classification results with
high confidence. Sharma et. al [15] proposed to find an informative       Generative Adversarial Networks(GANs) are a type of neural net-
subset of gene expression using feature selection methods. It’s like      work where two competing networks - the generator and the discrim-
the “Divide & Conquer” approach. As form the subset they are find-        inator - are adversarially trained against one another. The discrimi-
ing the informative genes, and then they are combining to form the        nator is trained to differentiate between real data and generated data
overall subset. Nanni et al. [11] proposed a method that combines         while the generator attempts to fool the discriminator by generating
different feature reduction approaches, useful for gene microarray        synthetic data. More specifically, the generator G samples from a
classification. In Zinovev et al. [21], the authors used decision trees   previously known data distribution z ∼ Pz (z) (usually a Gaussian)
to classify lung nodules using the LIDC dataset. The features taken       and generates data G(z) by putting z through a function G. The
by them are lobulation, texture, speculation, etc. Those are used         discriminator D takes in data x and produces a probability that x is a
to create a 63-dimensional feature vector for classification of 914       sample from the real data distribution Pdat a (x). The loss function
instances. The authors got an overall accuracy of 68.66%. Kuruvilla       that the discriminator D maximizes and the generator G minimizes is
et al. [10] used six distinct parameters including skewness and fifth
& sixth central moments, which are extracted from segmented single        L=minG maxD Ex ∼Pd at a (x ) [log D(x)] + Ez∼Pz (z) [log(1 − D(G(z))]
slices, containing 2 lung images along with the features mentioned
in [1] and have trained a feed-forward backpropagation neural net-        While this original GAN is useful for a multitude of tasks, the Jensen-
work. There has also been a renewed interest in the field of deep         Shannon divergence as loss function inherently struggles to learn
learning and the latest research in the area of medical imaging using     probability distributions between low dimensional manifolds in a
deep learning shows some good results. One such paper is of Suk           higher-dimensional space. Wasserstein GANs (WGANs) attempt to
et al., [17] in which the authors propose a novel latent and shared       solve this problem by using an approximation of the Earth-Mover
Lung nodule classification using Convolutional Autoencoder and Clustering Augmented Learning Method(CALM)


distance as the loss function, which enables more stable GAN train-            spatial locality. The reconstruction is hence due to a linear combina-
ing. The discriminator is now replaced with a critic as its output is          tion of basic image patches based on latent code. CAE combines the
no longer a probability; rather, it is a 1-Lipschitz function that tries       local convolution connection with the autoencoder, which is a simple
to maximize the difference in score between the real data and the              operation to add a reconstruction input for the convolution operation.
generated data. A function is 1-Lipschitz if and only if the norm of           The procedure of the convolutional conversion from feature maps
its gradient everywhere is at most 1. The authors of the WGAN paper            input to output is called convolutional encoder. Then the output val-
enforces that the critic is 1-Lipschitz by weight-clipping, which may          ues are reconstructed through the inverse convolutional operation,
lead to optimization difficulties. The new loss function is as follows:        which is called a convolutional decoder. Moreover, the parameters
                                                                               of the encode and decode operation are calculated through standard
 L=minG maxD ∈D Ex ∼Pd at a (x ) [log D(x)] − Ez∼Pz (z) [log D(G(z))]          autoencoder unsupervised greedy training.
                                                                                Input feature maps x ∈ Rn×l ×l , which are obtained from the input
Where D is the set of 1-lipshitz functions.

3.2    Autoencoder
An autoencoder takes an input x∈ Rd and first maps it to latent
representation h∈ Rd ′ using a deterministic function of type h
= fθ = σ (W x + b) with parameters θ = {W,b}. This “code” is
then used to reconstruct the input by a reverse mapping of f: y=
               ′       ′       ′     ′ ′
fθ ′ (h) = σ (W x + b ) with θ = {W ,b }. The two parameter sets are
usually constrained to be of form W = W T , using the same weights
                                      ′


for encoding the input and decoding the latent representation. Each
training pattern x i is then mapped onto its code hi and its reconstruc-
tion yi . The parameters are optimized, minimizing an appropriate
cost function over the training set D n = {(x 0, t 0 ), ..., (x n , tn )}.

3.3    Denoising Autoencoders(DAE)                                                            Figure 1: Convolutional Autoencoder
Without any additional constraints, conventional autoencoders learn
identity mapping. This problem can be circumvented by using a prob-            layer or the previous layer. It contains n feature maps, and size of
abilistic RBM(Restricted Boltzmann Machine) approach, or sparse                each feature map is l × l pixels. The convolutional autoencoder oper-
coding, or denoising autoencoders trying to reconstruct noisy inputs.          ation includes m convolutional kernels, and the output layer output
The latter performs as well as or even better than RBMs. Training in-          m feature maps. When the input feature maps from previous layer, n
volves the reconstruction of a clean input from a partially destroyed          represents the number of output feature maps from the previous layer.
one. Input x becomes corrupted input x by adding a variable amount             The size of convolutional kernel is d ×d, where d ≤ l. θ ={W,Ŵ , b, b̂}
v of a noise distributed according to the characteristics of the input         represents the parameters of convolutional autoencoder layer need
image. Common choices include binomial noise(switching pixels on               to be learned, while b∈ Rm and W={w j ,j=1,2,...,m} represents
or off) for black and white images or uncorrelated Gaussian noise for          the parameters of convolutional autoencoder, where w j ∈ Rn×l ×l
color images. Parameter v represents the percentage of permissible
                                                                               is defined as a vector w j ∈ Rnl . And Ŵ ={wˆj ,j=1,2,...,m} and b̂
                                                                                                                  2

corruption. The auto-encoder is trained to denoise the inputs by first
                                                                               represent the parameters of convolutional decoder, where wˆj ∈ Rnl .
                                                                                                                                                     2
finding the latent representation h= fθ (x) = σ (W x + b) from which
                                                     ′     ′
it reconstructs the original input y= fθ ′ (h) = σ (W h + b )
                                                                                  First the input image is encoded that each time a d × d pixels
3.4    Convolutional Neural Networks                                           patch x i ,i=1,2,...,p is selected from input image, and then the weight
                                                                               w j of the convolutional kernel j is used for convolutional calculation.
CNN’s are hierarchical models whose convolutional layers alternate             Finally the neuron value oi j ,j=1,2,...,m is calculated from the output
with subsampling layers, reminiscent of simple and complex cells in            layer.
the primary visual cortex. The network architecture consists of three
                                                                                                        oi j = f (x i ) = σ (Wj x i + b)
basic building blocks to be stacked and composed as needed,i.e, the
convolution layer, the max-pooling layer, and the classification layer.        where σ is a nonlinear activation function, often including three
                                                                               functions,i.e, the sigmoid function, the hyperbolic tangent function,
3.5    Convolutional Auto Encoder(CAE)                                         and the rectified linear function(Relu). We implemented Relu in this
                                                                               paper.
A fully connected autoencoder ignores a 2-D image structure. This
is not only a problem when dealing with realistically sized inputs but
                                                                                    Then oi j output from the convolutional decode is encoded that x i
also introduces redundancy in the parameters, forcing each feature
                                                                                is reconstructed via oi j for generated xˆi .
to be global. However, the trend in vision and object recognition                                           ′
adopted by most successful models is to discover localized features                                xˆi = f (oi j ) = ϕ(Ŵi oi j + b̂)
that repeat themselves all over the input. CAEs differ from conven-            xˆi is generated after each convolutional encode and decode. P
tional AEs as their weights are shared among all the input, preserving         patches are obtained from reconstruction operation of dimension
                                                                                           Soumya Suvra Ghosal, Indranil Sarkar, and Issmail El Hallaoui


d × d. We use the mean square error between the original patch
of input image x i ,(i=1,2,...p) and the reconstructed patch of image
x̂ i ,(i=1,2,...p) as the cost function. Furthermore, the cost function
and reconstruction error is described as:

                                           p
                                       1Õ
                        JCAE (θ ) =          L[x i , xˆi ]
                                       p i=1

          LCAE [x i , xˆi ] = ||x i − xˆi || 2 = ||x i − ϕ(σ (x i ))|| 2
   Through stochastic gradient descent(SGD), the weight and error                      Figure 2: Block Diagram of Proposed Model
are minimized, and the convolutional autoencoder layer is optimized.
Finally, the trained parameters are used to output the feature maps        mapping from X to Y. Specifically, the first layer L 1 receives the
which are transmitted to the next layer.                                   input image x and the middle layer has three convolution layers and
                                                                           three pooling layers.
4   METHODOLOGY
For our model, we will be using WGAN with gradient penalty                  Algorithm 1: Unsupervised Training of CAE
(WGAN-GP), a version of WGAN that replaces weight-clipping
with a gradient penalty of the critic - constraining the gradient norm      1   Given dataset U, number of convolution, pooling layer along
of the critic’s output concerning its input. This allows for more               with all weight matrices and bias vectors are randomly
stable GAN training. The optimal WGAN or WGAN-GP critic will                    initialized
contain straight lines with gradient norm 1 connecting coupled points       2     −1
                                                                                i←−
between Pdat a and PG(z) ; since enforcing the unit gradient norm
                                                                            3   if i==1 then
constraint everywhere is intractable, it is only enforced along these
                                                                            4        The input of Ci is U
straight lines. The new loss function is as follows:
  L=minG maxD ∈D Ez∼Pz (z) [log D(G(z)] − Ex ∼Pd at a (x ) [D(x)] +         5   else
                                                                            6       The input of Ci is output of Pi
                    λEx̂ ∈Px̂ (x̂ ) [(||∇x̂ D(x̂)||2 − 1)2 ]
                                                                            7   Greedy layer wise training Ci
Where λ is the weight given to the gradient penalty. x̂ ∼ P(x̂) are
random samples that have uniform distribution along straight lines          8   Find parameters of Ci by cost function
between pairs of points sampled from the real data distribution Pdat a      9   Output of Ci is input to Pi
and the generated data distribution PG(z) . We hypothesize that gen-       10   Max Pooling Operator
erated data can improve lung nodule detection sensitivity, allowing
for better training of CAD systems with existing data. We can use          11   if i < N then
the generator to produce new training data to augment the existing         12        goto line 3
training data.

   Since the workload for labeling ROI is high and the pulmonary              The convolutional autoencoder has the following architecture :
nodules are difficult to be recognized, the CT images are divided               • Input: 40 × 40 patch image from CT image
into small patch areas for training the network. The patch divided              • C1: Convolution kernel of size 5 × 5, Number of kernel is 50,
from the CT image is input to Convolutional Autoencoder(CAE)                      non linear function is ReLU.
for the purpose of learning the feature representation, which is used           • P1: Max pooling is used, the size of pooling area is 2 × 2 with
for classification. The parameters of convolution layers in CNN are               stride 2.
determined by autoencoder unsupervised learning, and some data                  • C2: Convolution kernel of size 3 × 3, Number of kernel is 50,
is used for fine-tuning the parameters of the CAE and training the                non linear function is ReLU.
classifier.                                                                     • P2: Max pooling is used, the size of pooling area is 2 × 2 with
   The patch divided from the original CT image can be represented                stride 2.
as x ∈ X, X ⊂ Rm×d ×d , where m represents the number of the input              • C3: Convolution kernel of size 3 × 3, Number of kernel is 50,
channel, and d × d represents the input image size. The labeled data              non linear function is ReLU.
are represented as y ∈ Y , Y⊂ Rn , where n represents the number of             • P3: Max pooling is used, the size of pooling area is 2 × 2 with
output classification. Through the proposed model, it is expected to              stride 2.
deduce the hypothesis function from the training,i.e.,f: X−−→Y and         The convolutional autoencoder is trained in an unsupervised manner,
the set of parameters θ .                                                  which is explained in Algorithm 1 and the parameters are optimized
                                                                           through SGD. A mini-batch size of 100 samples and 150 iterations
   In the proposed model, the hypothesis function f based on deep          for each batch is used.
learning architecture consists of multiple layers, which is not a direct   The output from the last pooling layer is fed as input to the CALM
Lung nodule classification using Convolutional Autoencoder and Clustering Augmented Learning Method(CALM)


classifier, which is explained in 5.                                           As a result, the clustering process would aggregate the data having
                                                                               similar characteristics resulting in better learning by the FNN model.
                                                                               We include the following additional constraints:
5  CLUSTERING AUGMENTED LEARNING                                                                   chl = 1                                                  (1)
   METHOD (CALM)
                                                                                                  cha = 0,              ∀a ∈ {1, . . . , |C |}, l , a       (2)
5.1 Proposed Approach
Input augmentation We consider a matrix of input data D and a set               5.2    Clustering Problem
of cluster centers C. Since in this case study, there are probabilities        We have a distance/dissimilarity measure dil between input examples
of the nodule being either malignant or not, we keep C as 2. In this           i ∈ D and cluster centers l ∈ C. The clustering problem aims to
paper, we use clustering to augment input data x ∈ D for better                assign each input example to a cluster such that the total distance
learning. To augment the input data, we add a new set of features              between the elements of a cluster and its center is minimized. We
representing either an input example belongs to a cluster or not.              introduce a new set of binary variables c il that is equal to 1 if input
To distinguish input examples, we introduce an additional index                example i ∈ D belongs to the cluster whose center is l ∈ C, and 0
h ∈ {1, . . . , |D|} representing the number of an input example (x 1 is       otherwise. The clustering problem is formulated as follows:
the first input example of D). We define also a vector ch composed                              ÕÕ
                                                                                       min                  d il c il                                       (3)
of chl , l ∈ C for each example xh ∈ D. It is a one-hot representation                          i ∈D l ∈C
containing zeros except for the index of the cluster it belongs to                              Õ
                                                                                        s.t.           c il = 1, ∀i ∈ D And c il ∈ {0, 1}, ∀i ∈ D, ∀l ∈ C   (4)
(e.g. c 1 = [0, 1] means that the first input example x 1 belongs to the                        l ∈C
2nd cluster out of 2 clusters). Finally, we augment input examples              The objective function (3) minimizes the total distance between
by concatenating the vector xh with the vector ch for each h ∈                 a cluster center and its elements. Constraints (4) ensure that each
{1, . . . , |D|}.                                                              element is assigned to exactly one cluster and that the decision
                                                                               variables are binary.
                                                                               In this paper, we also propose a novel dissimilarity measure based
                                                                               on the weights of the trained FNN model. It uses the average of
                                                                               weights linked to each neuron of the input layer. Assuming that the
                                                                               original input (without the new clustering feature) has d dimensions
                                                                               (xh = [xh1 , . . . , xhd ], h ∈ {1, . . . , |D|}) and the weight linking node n
                                                                               of the input layer to node j ∈ {1, . . . , n 1 } of the following layer is
                                                                               w nj , the two distances measures are formulated as follows:
                                                                                                 dil =                              w nj |x ik − xlk |
                                                                                                              Í
                                                                                                                            avg
                                                                                                         n ∈ {1...d } j ∈ {1,...,n 1 }
                                                                               Thus the distance measure computes the distance between two exam-
                                                                               ples based on how important is the contribution of each input feature
                                                                               to the resulting prediction. Therefore, the resulting clusters contain
                                                                               examples with similar potential to improve the classification results.

Figure 3: Architecture of Clustering Augmented Learning                         5.3    Proposed Algorithm
Method(CALM) Classifier
                                                                               As in Fig. We propose an approach (Algorithm 2) where we it-
                                                                               eratively train the FNN classifier, use its weights for input data
Cluster centers To determine the cluster centers, CALM consists of
                                                                               clustering thus changing the input vector, train again the FNN classi-
a clustering model and a Feed-Forward Neural Net(FNN) having a
                                                                               fier using the new input data, and so on until a stopping criterion is
softmax output to classify the lung nodules. For the clustering model,
                                                                               attained. The stopping criterion is triggered if the cluster assignment
we propose to use a Random Forest classifier to determine cluster
                                                                               remains the same for consecutive 10 iterations, i.e., the clustering
centers. After the FNN is trained using a state-of-the-art solver for
                                                                               problem converges.
data belonging to a single cluster ∈ {1, . . . , |C |}, a Random Forest
                                                                               The configuration of the proposed model is given as:
Classifier is used to find the best cluster center. Hence we repeat
|C | instances of training the FNN to find the |C | centers. For any               A) Classification Model: FC1 −→    − Leaky ReLU −→    − FC2 −→  −
instance l of the model, we use one hot encoded vector of l as                         Leaky ReLU −→FC3
                                                                                                    −        −→
                                                                                                             − Softmax . Dimension of FC1: 128.
labels for all the input sample in that cluster to train the random                    Dimension of FC2: 32. Dimension of FC3: 2.
classifier in a supervised manner. In simple words, while predicting               B) Optimizer: ADAM Learning Rate 0.001, momentum rate
center of 2nd cluster (for example) we use [0, 1] as label for all                     0.9, weight decay(L2 regularization):1e-4.
input sample in that cluster, since |C | is 2. We propose that the input
sample which has the lowest error in predicting its cluster label is            6     DATASET
considered as the center of that cluster in the subsequent iteration of        The Lung Image Database Consortium (LIDC) has made a database
the proposed approach. In such a manner, the center would be the               publically available that contains thoracic CT images of 1010 pa-
input sample which is the most fitting representative of that cluster.         tients of lung cancers, and each scan has been annotated by up to 4
                                                                                           Soumya Suvra Ghosal, Indranil Sarkar, and Issmail El Hallaoui


 Algorithm 2: Clustering-augmented learning method                           number of training samples is 4338 positive + 4338 negative= 8676,
                                                                             so the number of augmented data added is [4338*0.5]=2169 posi-
   Step 0: Data obtained after extracting information using
                                                                             tive and [4338*0.5]=2169 negative samples. Since negative training
   Convolutional Autoencoder(CAE) acts as input to CALM.
                                                                             volumes are easy to obtain, the WGAN-GP is trained on all of the
   Step 1: Initialization of the cluster centers u 1, ...u |C |              positive training examples so that it will generate positive data.
   randomly. Clustering of the output data obtained from
   Convolutional Autoencoder(CAE) and augmenting each data                                    Table 1: Performance of models
   sample with its one-hot encoded cluster label.
                                                                                      Model           Accuracy       Precision    Recall      F1     AUC
   Step 2: Training the FNN classifier & clustering model
                                                                                 GAN+CAE+CALM
   foreach l ∈ {1 . . . |C |} do                                                  (Proposed Model)
                                                                                                        95.3%         94.9%        95%       95%     0.97
       Train the FNN model on data belonging to cluster l to
                                                                                   GAN+CAE+NN           94.2%         94.6%       93.5%      93.5% 0.93
       learn classification.
       For supervised training of the random forest classifier we                 GAN+CAE+LR            90.3%          92%         92%       92%     0.91
       use one hot encoded representation of clusters as labels.                 GAN+CAE+SVM            90.1%          92%         92%       92%     0.90
       Running the clustering model gives the cluster center ul .
                                                                                      AE[9]              77%           76%         77%       77%     0.83
   Step 3: Clustering
                                                                                       CNN               89%           88%         90%       89%     0.95
   Update dissimilarity matrix using W ∗
                                                                             Where GAN represents Generative Adversarial Network, CAE represents
   if stopping criterion is attained then Stop.                               Convolutional Autoencoder, CALM represents Clustering Augmented
   else go to Step 2.                                                        Learning Method, NN represents Neural Network, LR represents Logistic
                                                                               Regression, SVM represents linear kernel Support Vector Machine.

radiologists on semantic characteristics and malignancy.The ratings
were obtained by performing the biopsy, surgical resection, progres-
sion or reviewing the radiological images to show 2 years of nodule
state at two levels; first at the patient level and second diagnosis at
the second level. The LIDC database of thoracic CT studies for 1010
patients was acquired over a long period with various scanners.
   We excluded nodules with outliers in x, y or z dimensions. Out-
liers are defined as values more than 1.5 times the interquartile range
above the third quartile. We also excluded scans with slice thickness
greater than 2.5 mm. This left 666 CT scans for training and 86 CT
scans for evaluation. To reduce noise in our training data, we also
exclude nodules by less than 3 radiologists.
   The LIDC dataset also provides information and coordinates on
each nodule. We chose an input size of 40 × 40 since that is large
enough to fully contain the largest nodules. Classic data augmen-
tation was performed on the positive examples: translations of up
to 10 pixels in the XY plane are added to the positive training set.
Negative data is defined as inputs that did not contain nodules agreed
on by any radiologists. The final input data has 5422 image labels
of size 40px × 40px. For comparison, the size of a whole CT scan
                                                                                    Figure 4: Plot of Intra-Cluster Variance vs Iterations
is 512px × 512px × N slices, where N corresponds to the number of
slices, ranging from [65,764] for different CT scans. The training
and evaluation sets are randomly partitioned following proportion
8:2. Precisely, there are [0.8×5422]=4338 initial positive training          7     RESULT
examples, and since we want our initial training data to be bal-             The convolutional neural network for learning lung nodule image
anced, we also take 4338 initial negative training examples of a             feature is similar to common image feature learning. Both CNN
practically infinite number available. In total, the initial training data   and conventional learning use the labeled dataset, and learn the
consists of 8676(4338 positive + 4338 negative) training examples            network parameters between each layer from the input layer to
and 2168(1084 positive + 1084 negative) validation examples.                 the output layer by use of forwarding and backward propagation
   In this paper, to improve lung nodule detection in existing CADe          methods. We compare the classification performance of the proposed
systems, we augment training data-sets with generated images ob-             model, autoencoder(AE)[9], convolutional neural network(CNN)
tained using Generative Adversarial Network (GAN). We used an                with the same dataset. Results are shown in Table(1) and Receiver
augmentation rate of 50% while using GANs. Since the original                Operating Characteristics Curve(ROC) is shown in Fig.6. To justify
Lung nodule classification using Convolutional Autoencoder and Clustering Augmented Learning Method(CALM)


                                                                                                 Table 2: Comparison with Literature

                                                                                                          Model                           Accuracy
                                                                                                    Proposed model                          95.3%
                                                                                            Kuruvilla and Gunavathi[10]                     93.3%
                                                                                                Nascimento et al. [12]                     92.78%
                                                                                                    Krewer et al. [8]                      90.91%
                                                                                                      da Silva [16]                         82.3%
                                                                                                    Kumar et al. [9]                         77%



                                                                                8    CONCLUSION
               Figure 5: Training and Testing Loss                             In this paper, we present a novel approach to assist in CT image
                                                                               analysis. Approaches based on segmentation and handcrafted fea-
                                                                               tures are time-consuming and labor-intensive, while the data-driven
the contribution of the CALM classifier, we also compare the results           approach is available to avoid the loss of important information in
by using traditional classifiers such as logistic regression, linear           nodule segmentation. Methods based on Convolutional Neural Net-
kernel support vector machine on the features obtained from the                work(CNN) suffer from the scarcity of labeled data in the medical
last pooling layer of the convolutional autoencoder. Moreover, Fig.4           domain. To overcome that issue, in this paper, we propose the use of
shows how the intra-cluster variance decreases after approximately             Generative Adversarial Networks to augment training data. We lever-
75 iterations and then stabilizes. To measure intra-cluster variance,          age Convolutional Autoencoder architecture for feature learning, in
we used Euclidean distance in this case study. Similarly, it is evident        which the network is initially trained in an unsupervised manner with
from Fig. 5 that testing loss starts decreasing after 80 epochs and            a large amount of data and later on the classifier is fine-tuned using
gradually as the clustering solution converges the accuracy begins             a supervised approach. Referring to the result and the comparison ta-
to improve. This observation bolsters our initial assumption that              ble, our proposed system outperforms the literature mentioned in the
clustering data based on inherent characteristics would improve the            related works section. In the future, we will work on amalgamating
learning process of FNN.                                                       domain knowledge and data-driven feature learning.
   The accuracy, precision, recall, F1, and AUC of the proposed
method are 95.3%, 94.9%, 95%, 95% and 0.97 respectively. For AE                 REFERENCES
(Autoencoder) method, we train the neural net in an unsupervised                 [1] A. A. Abdullah and S. M. Shaharum. 2012. Lung cancer cell classification method
                                                                                      using artificial neural network. Information Engineering Letters 2, 1 (2012),
manner and test on the same dataset for classification. We use 1024                   49–59.
neurons in the fully connected layer in the AE method. We have also              [2] C Aliferis, I Tsamardinos, P Massion, P Fanananpazir, D Hardin, A Statnikov, N
                                                                                      Fananapazir, and D Hardin. 2003. Machine learning models for classification of
                                                                                      lung cancer and selection of genomic markers using array gene expression data.
                                                                                      In FLAIRS.
                                                                                 [3] Min Chen, Xiaobo Shi, Yin Zhang, Di Wu, and Guizani Mohsen. 2017. Deep
                                                                                      Feature Learning for Medical Image Analysis with Convolutional Autoencoder
                                                                                      Neural Network. IEEE transactions on Big Data (2017).
                                                                                 [4] M. J. M. Chuquicusma, S. Hussein, J. Burt, and U. Bagci. 2018. How to fool
                                                                                      radiologists with generative adversarial networks? a visual turing test for lung
                                                                                      cancer diagnosis. In 2018 IEEE 15th International Symposium on Biomedical
                                                                                      Imaging (ISBI 2018). IEEE, 240–244.
                                                                                 [5] Carsten Eickhoff, Yubin Kim, and Ryen White. 2020. Overview of the Health
                                                                                      Search and Data Mining (HSDM 2020) Workshop. In Proceedings of the Thir-
                                                                                      teenth ACM International Conference on Web Search and Data Mining (WSDM
                                                                                     ’20). ACM, New York, NY, USA. https://doi.org/10.1145/3336191.3371879
                                                                                 [6] R. Fakoor, F. Ladhak, A. Nazi, and M. Huber. 2013. Using deep learning to
                                                                                      enhance cancer diagnosis and classification. In ICML Workshop on the Role of
                                                                                      Machine Learning in Transforming Healthcare (WHEALTH). ICML, 4493–4498.
                                                                                 [7] M. Frid-Adar, I. Diamant, E. Klang, M. Amitai, J. Goldberger, and H. Greenspan.
                                                                                      2018. Gan-based synthetic medical image augmentation for increased cnn perfor-
                                                                                      mance in liver lesion classification. arXiv:1803.01229 (2018).
                                                                                 [8] H. Krewer, B. Geiger, and L. O. Hall. 2013. Effectoftexturefeatures in computer
                                                                                      aided diagnosis of pulmonary nodules in low-dose computed tomography. In
                                                                                      IEEE International Conference on Systems, Man, and Cybernetics (SMC). IEEE,
                                                                                      3887–3891.
                  Figure 6: ROC on classification                                [9] D Kumar, A Wong, and D.A Clausi. 2015. Lung Nodule Classification using deep
                                                                                      features in CT images. In 12th IEEE Conference on Computer and Robot Vision
                                                                                      (CRV). IEEE, 133–138.
                                                                                [10] J Kuruvilla and K Gunavathi. 2014. Lung cancer classification using neural
compared the proposed model with accuracy obtained by previous                        networks for ct images. Computer methods and programs in biomedicine 113, 1
literature. Comparison is shown in Table(2).                                          (2014), 202–209.
                                                                                                            Soumya Suvra Ghosal, Indranil Sarkar, and Issmail El Hallaoui


[11] L. Nanni, S. Brahnam, and A. Lumini. 2012. Combining multiple approaches for                SVM. In Machine Learning and Data Mining in Pattern Recognition.
     gene microarray classification. Bioinformatics (2012), 1151–1157.                      [17] H. I. Suk, S. W. Lee, D. Shen, and A. D. N. Initiative. 2014. Hierarchical feature
[12] L. B. Nascimento, A. C. de Paiva, and A. C. Silva. 2012. Lung nodules classifi-             representation and multimodal fusion with deep learning for ad/mci diagnosis.
     cation in CT images using Shannon and Simpson diversity indices and SVM. In                 NeuroImage 101 (2014), 569–582.
     Machine Learning and Data Mining in Pattern Recognition. 454–466.                      [18] Y Wang, I.V Tetko, M.A Hall, E Frank, A Facius, K.F. X Mayer, and H.W Mewes.
[13] S Ramaswamy, P Tamayo, R Rifkin, S Mukherjee, C Yeang, M Angelo, C Ladd,                    2005. Gene selection from microarray data for cancer classification-a machine
     M Reich, E Latulippe, J.P Mesirov, T Poggio, W Gerald, M Loda, E.S Lander, ,                learning approach. Comput. Biol. Chem. 29, 1 (2005), 37–46.
     and T.R Golub. 2001. Multiclass cancer diagnosis using tumor gene expression           [19] G. Wu, M. Kim, Q. Wang, Y. Gao, S. Liao, and D. Shen. 2013. Unsupervised deep
     signatures. In National Academy of Sciences of the United States of America.                feature learning for deformable registration of mr brain images. In Medical Image
[14] A. Riccardi, T. S. Petkov, G. Ferri, M. Masotti, and R. Campanini. 2011. Computer-          Computing and Computer-Assisted Intervention–MICCAI 2013. SPRINGER, 649–
     aided detection of lung nodules via 3d fast radial transform, scale space represen-         656.
     tation, and zernike mip classification. Medical physics 38, 4 (2011), 1962–1971.       [20] X. Zhu, Y. Liu, J. Li, T. Wan, and Z. Qin. 2018. Emotion classification with data
[15] A. Sharma, S. Imoto, and S Miyano. 2012. A top-r feature selection algorithm for            augmentation using generative adversarial networks. In Pacific-Asia Conference
     microarray gene expression data. IEEE/ACM Trans. Comput. Biol. Bioinformatics               on Knowledge Discovery and Data Mining. SPRINGER, 349–360.
     9 (2012), 754–764.                                                                     [21] D. Zinovev, J. Feigenbaum, J. Furst, and D Raicu. 2011. Probabilistic lung nodule
[16] G. L. F. da Silva, A. C. Silva, A. C. de Paiva, and M. Gattass. [n.d.]. Lung nodules        classification with belief decision trees. In Engineering in Medicine and Biology
     classification in CT images using Shannon and Simpson diversity indices and                 Society, EMBC. IEEE, 4493–4498.