Cytological Image Classification Using Data Reduction
              Oleh Berezsky1[0000-0001-9931-4154], Oleh Pitsun2[0000-0003-0280-8786],
            Taras Dolynyuk3[0000-0003-1758-1203], Lesia Dubchak4[0000-0001-5763-9540],
            Nadiya Savka5[0000-0003-4182-7867], Grygoriy Melnyk5[0000-0003-0646-7448],
                              Vasyl Tesluk7[0000-0002-5974-9310]
                 1,2,3,4,5,6
                            Ternopil National Economic University, 46001, Ukraine
                       7
                           Lviv Polytechnic National University, 79000, Ukraine

                                        ob@tneu.edu.ua
                                     o.pitsun@tneu.edu.ua
                                       trsdln@gmail.com
                                       dlo@tneu.edu.ua
                                     n.savka@tneu.edu.ua
                                        mgm@tneu.edu.ua
                                    vasyl.teslyuk@gmail.com


        Abstract. In this paper, the authors investigate data reduction techniques using
        cytological image data for the purpose of further classification applying modern
        approaches. Cytological images are widely used in diagnosing cancerous and
        precancerous conditions of the breast. Classification of the whole image is a ra-
        ther time consuming process, so the authors apply an approach when quantita-
        tive characteristics of micro-objects (cell nuclei) are used for classification. The
        authors carried out a comparative analysis of the classification of the cytologi-
        cal images based on the quantitative characteristics of their nuclei using modern
        classifiers. The main criteria for describing micro-objects (cell nuclei) are the
        following ones: area, perimeter, circumference, maximum width and length, ar-
        ea and perimeter of the bounding box. The structure of the biomedical image
        classification system is developed, including image processing stages, calcula-
        tions of quantitative characteristics of micro-objects, data reduction and classi-
        fication. The principal component method is used as data reduction technique.
        To classify data the following methods are used: a single-layer perceptron, lo-
        gistic regression, support vectors machine, and the k-nearest neighbor method.
        Testing was performed using BPCI2100 database of cytological images of can-
        cerous and precancerous conditions of the breast.

        Keywords: Classification, Principal Component Method , Cytology, breast
        precancerous conditions.


1       Introduction

Cytological and histological images are used to diagnose precancerous and cancerous
conditions of the breast. After a microscopic examination, a specialist can determine
the type of an image. To simplify the analysis process of cytological and histological
images, a number of automated microscopy systems (AMSs) with functions for image

Copyright © 2019 for this paper by its authors. Use permitted under Creative Commons License Attribution
4.0 International (CC BY 4.0)
2019 IDDM Workshops.
2


processing were developed. The use of artificial intelligence, in particular artificial
neural networks, support vector machines, etc. show the current trends of improve-
ment in AMSs [1]. This makes the process less time-consuming and allows us to
increase the diagnostic efficiency. The main indicators of the pathology in cytological
specimens are the shape and structure of the cells. The following quantitative charac-
teristics of the investigated micro-objects such as area, perimeter, circumference,
angle of inclination of the main axis, etc. are used. To calculate them, firstly, the input
image is preprocessed (filtering, histogram alignment). In next stage, the following
segmentation methods are used: threshold segmentation, watershed method, k-means
method or their combinations. When the cytological image is converted into a binary
format, each micro-object is detected and the quantitative characteristics are calculat-
ed. The training sample has a set of features. Array features include redundant and
uninformative features. Therefore, more time is required for classification. So, it is
necessary to reduce the input feature set. The following methods are used for feature
reduction: complete search, depth-first search, breadth-first search, branch-and-bound,
group method of data handling, feature ranking, feature clustering, evolutionary
search, etc. [2]. In this paper, the principal component method is used for reduction of
input characteristics. The following methods have been selected as high-level com-
puter vision tools: support vector method, logistic regression, a single-layer percep-
tron, and the k-nearest neighbor method. Support vector method is a method of ana-
lyzing data for classification using directed learning models. Each element of the
training samples is assigned to a certain class. The training algorithm creates a model
that assigns new samples to one of the classes. Formally, the support vector machine
builds a hyperplane, or a set of hyperplanes in high-dimensional space that can be
used for classification, regression and other tasks [3]. Logistic regression is a statisti-
cal regression method used when a dependent variable is categorical, that is, it can
have only two values [4]. The idea of logistic regression is that the space of the origi-
nal values can be divided by a line into two corresponding classes. The k-nearest
neighbor method assigns objects to the class that most of its k-nearest neighbors be-
long to in a multidimensional feature space. This is one of the simplest algorithms for
learning classification models [5]. A single-layer perceptron is the simplest kind of
artificial neural networks, which is based on a mathematical model of the information
perception by the human brain, and consists of sensors, associative and responsive
components.
The main advantage of using quantitative characteristics for the purpose of micro-
object classification is the lack of a subjective human factor. Therefore, an urgent
problem is evaluation of the quantitative characteristics of micro-objects, their reduc-
tion and data classification using modern classifiers to improve diagnostic accuracy.


2      Literature review

Classification is an important part of the data analysis process, which can be per-
formed by different algorithms divided into different groups. These groups are based
                                                                                       3


on machine learning techniques [6]. Modern approaches to detection and classifica-
tion of cell nuclei in cytological images are considered in [7]. In this paper, the au-
thors compare classification results obtained using manual markups and deep learning
methods. In [8], the authors provide a comparative analysis of the results of cytologi-
cal image classification using the k-nearest neighbor method and the support vector
method. In the experiments, the shape of the nuclei and the structure of the tissue
were taken into account. The study of the k-nearest neighbor method is relevant in the
field of data mining and machine learning [9]. Zhao in [10] developed a new algo-
rithm based on the use of labeled samples. These methods were used mainly for fast
searching [11], reducing the dimension, and improving the efficiency of algorithms.
The support vector method is a set of learning methods used for classification and
regression. They belong to the family of generalized linear classification [12].
In [13], a hybrid method of combining the support vector method and accelerator
methods was developed. The authors show the effectiveness of the classification, and
the SVM is used as the basic classifier for the data group classification. In [14-15],
structures of convolutional neural networks were proposed for the classification of
breast cancer histopathology images regardless of their degree of enlargement. The
advantage of the developed systems on the basis of the proposed structures is the
automation of the diagnostic process and the formation of a database for further re-
search. Comparison of the quality of the breast cancer detection using magnetic reso-
nance imaging and immunohistochemical studies is presented in [16]. A comparative
analysis of approaches to biomedical image analysis is presented in [17,18].
   The analysis of the above-mentioned publications has shown that scientists pay
considerable attention to the problem of finding the ways of diagnosing precancerous
and cancerous conditions of the breast on the basis of artificial intelligence systems.
However, the complexity of the study and the large number of classifiers require addi-
tional research and comparative analysis, and a considerable amount of data needs to
be reduced in size.


3      Problem statement

The purpose of this work is to analyze the existing means of artificial intelligence and
their applications for the classification of cytological images based on the quantitative
characteristics of micro-objects. To achieve this goal, the following tasks must be
accomplished:
   1. To      reduce      data     using      the    principal   component       method.
   2. To develop the structure of classification system of cytological images.
   3. To conduct computer experiments in order to carry out a comparative analysis
   of the classifiers.
   Formally, the formulation of the problem is as follows. Assume a set of features
(1):


                           X   xij  , i  1, m , j  1, n ,                       (1)
4


where n – a number of features, m – a number of their implementations.
In addition, a set of classes P is given.
To carry out the classification procedure we use a set of classifiers (2):

                                    C  C1 , C 2 , ..., Ct  ,                                 (2)

where t – a number of classifiers.


    After applying the PCA, we obtain a set of components Y   y i  , where Vі  1, m .
Each component is a linear combination of features (3):


                          y j  w1 j x1  w2 j x2  ...  wrj xr ,                              (3)

when r  m .

Thus, we get a set of components with their contributions  y , I  , k  1, s , where  k   k


s – a number of principal components.
The classification accuracy  0 is specified. Then, it is necessary to find s value in
such a way:

                                    s  Y  min Y . (4)
                                              s

                                                         s
                                                        Y Y ,    0


4       Principal component analysis
Input feature matrix X is given:


                         x   11
                                       x12        ...         x1 j       ...   x1 n 
                         x            x22        ...         x2 j       ...   x2 n 
                             21
                                                                                 
                                                 ...                    ...     
                      X                                                        .             (5)
                           x           xi 2       ...         xij        ...   x
                              i1
                                                                                 
                                                                                in


                                                 ...                    ...     
                                                                                
                         x   m1
                                      xm 2        ...         xmj        ...   x 
                                                                                mn
                                                                                                                  5


  In columns there are features, they are indexed by j ( j  1, n ), and the lines are

their implementations. They are indexed by i ( і  1, m ).
   The PCA implementation can be presented by a number of steps.

1. Centering and rationing of the output data is performed according to the formula
   (6):

                                                          xij  x j
                                                                            ,                                    (6)
                                                              j
where j – the number of the original variable, i – the implementation number of the
 j -th variable, and x j and  – the arithmetic mean and root mean square deviation
                                         j


of the x feature.
        j


2. Calculating the covariance (S) or correlation (R) matrix.

                                                                                      cov x , x 
                                     2
                                 x1
                                                      cov x , x
                                                                                                      
                                                                1   2                          1   n


                                                           x
                                                               2


                    S
                        cov         x 2 , x1                   2
                                                                                        cov x , x 
                                                                                               2   n


                                                                                                      
                                                                                                            ,    (7)

                                                                                                      
                        cov        x n , x1
                                                      cov x , x
                                                                n   2
                                                                                          x
                                                                                               2

                                                                                               n
                                                                                                       

                                                        x  x  x  x  ,
                                                          n
                                                 1
                    cov xk , x j                                                                                (8)
                                         n  1 i 1
                                                                    ik              k     ij           j


                                                                cov xk , x j
                                                 rxk , x j                                                      (9)
                                                                   x xk       j


  3. Finding the eigenvalues     …   p  0 of the matrix S (or R) using char-
                                             1        2


  acteristic equation (10):

                       det  S   E   0 or det  R   E   0 ,                                             (10)

where E – a unitary matrix (a square matrix with ones on the diagonal and zeros
elsewhere).

  4. Finding the eigenvector for each eigenvalue  j . The eigenvector is the solution
to the system of equations (11):
6


                      S   E   w  0 , or  R   E   w  0 ,                 (11)


        where w – eigenvector.
                                                                       yj
    5. Finding linear combinations for principal components

                          y j  w1 j x1  w2 j x2  ...  w pj x p .                (12)

   6. Analysis of the contribution of each of the principal components and their rank-
ing in ascending order.
        The contribution of each component is evaluated by the formula (13):

                                               j
                               Ij                          ,                       (13)
                                      1  2  ...   p
where j – a number of a component.


5       Structure of the cytological image classification
        module
The lack of clear contours of cell nuclei leads to difficulties in cytological image pro-
cessing. Transmission of the digital image from the camera to the microscope and via
communication channels to the computer causes pulse noises, which can often be
classified by the software system as a part of the investigated object. Examples of
cytological images of precancerous and cancerous conditions of the breast are taken
from the BPCI2100 database [19](Fig. 1).
                                                                                     7


                               Fig. 1. Cytological images

   During the study, the following quantitative characteristics of cell nuclei were de-
termined: area, perimeter, length, width, circumference, coordinate Xc, coordinate
Yc, length of the major axis, length of the minor axis, angle of inclination of the ma-
jor axis to the OX axis, perimeter of a bounding box, coordinate Bx, coordinate By,
bounding box width, bounding box length, bounding box area, aspect ratio.
   The structure of cytological image classification module is shown in Figure 2.
8


               Fig. 2. Structure of cytological image classification module

The classification process consists of the following steps:
   1. Filtering the input image. This stage makes it possible to reduce noise. Gaussian
and Median filtering is used to reduce Gaussian and pulse noise, respectively. Filter-
ing can significantly improve image quality that will have positive impact on the next
stages.
   2. Segmentation. A segmentation stage is required to select particular areas in the
image (cell, background nuclei). For cytological images, the best results were shown
by watershed algorithms, k-means, and threshold segmentation. Based on these algo-
rithms, an algorithm for segmentation of cytological and histological images was
                                                                                        9


developed [20]. Quantitative evaluation of image segmentation quality was performed
on the basis of Gromov-Frechet and Gromov-Hausdorff metrics.[21]
   Cell nuclei are selected using the image contour selection algorithm developed by
the authors[22,23].
   After converting the image into the type of “white background – black objects”, the
quantitative characteristics of the particular micro-objects are evaluated.
   3. Data reduction. The principal component method is used to reduce the data size.
   4. Classification. Data in the form of an array of numbers is fed to the classification
module. The classification module implements the support vector method, logistic
regression, a single-layer perceptron and the k-nearest neighbor method. Classifica-
tion results in an associative array that includes data on the parameters of the nuclei
and their corresponding classes.
To analyze the classification results, ROC-curves were constructed and AUC coeffi-
cients were calculated.


6      Structure of the cytological image classification
       module
The software module for testing the cytological image classification techniques is
written in Java programming language and deeplearning4j library.
   An example of the data reduction of the analysis of the cytological image cell nu-
clei is given in Figure 3.


                     Fig. 3. Data after reduction (3 main components)
10


  The results of the classification of the cytological images by the support vector
method are shown in Figure 4.


         2 components (AUC = 0.75)                      3 components (AUC = 0.6


         5 components (AUC = 0.79)                          8 components (AUC = 0.91)
Fig. 4. The results of the classification of the cytological images by the support vector method

   Classification quality score using the support vector method on the basis of the AUC coeffi-
cient is shown in Figure 5.
                                                                                       11


                                                                                       F
             ig. 5. Classification quality score using the support vector method


   The analysis of the results in Figure 5 shows that the highest classification accura-
cy can be achieved using 8 components. Classification results with the use of a single-
layer perceptron are shown in Figure 6.


           2 components (AUC=0.27)                           3 components (AUC=0.93)

        Fig. 6. Classification results with the use of a single-layer perceptron

   The AUC coefficient shows that the best classification quality is obtained with the
use of 3 components.
   The results of the classification of the cytological images based on multiclass lo-
gistic regression are shown in Figure 7.
12


Fig. 7. The results of the classification of the cytological images based on multiclass
                                   logistic regression

  Therefore, the best result is achieved with the use of 3 and 5 components and is
75%.


Conclusions

1. Using the basic algorithms of low, medium and high levels of computer vision, a
   classification structure of cytological images is developed. The developed structure
   includes the use of the principal component method to reduce the data size.
2. Applying the principal component method, the input indicators were reduced
   which showed that mainly three components are informative.
3. Computer experiments have shown that the best result of the classification of the
   quantitative characteristics of the cytological image nuclei is obtained using the 3
   major components. The AUC is about 93%.


References
     1.   Tsmots I., Teslyuk V., Teslyuk T., Ihnatyev I. Basic Components of Neuronetworks
          with Parallel Vertical Group Data Real-Time Processing. In: Shakhovska N.,
          Stepashko V. (eds) Advances in Intelligent Systems and Computing II. CSIT 2017.
          Advances in Intelligent Systems and Computing, Springer, Cham, vol. 689, 558 –
          576 (2017).
     2.   Berezsky O., Pitsun O., Batryn N., Berezska K., Dubchak L. Modern automated
          microscopy systems in oncology. Proceedings of the 1st International Workshop on
          Informatics & Data-Driven Medicine, Lviv, Ukraine, 28-30 november 2018 – p. 16
          (2018)
                                                                                     13

3.  Oliinyk A.O., Subbotin S. O., Oliinyk O.O. Intelektualnyi analiz danykh : navchalnyi
    posibnyk . – Zaporizhzhia : ZNTU. – 278 (2012)
4. Suthaharan S. Support Vector Machine. Machine Learning Models and Algorithms
    for Big Data Classificationю – pp. 207-235 (2016)
5. Agresti A. Categorical Data Analysis. Wiley-Interscience, New Jork (2002)
6. Zhang Z. Introduction to machine learning: k-nearest neighbors. Ann Transl Med.
    2016 Jun; 4(11), 218 (2016)
7. Trifonov R., Gotseva D., Angelov V. Binary classification algorithms. International
    Journal of Development Research. 7 (11), 16873-16879 (2017)
8. Hady Ahmady Phoulady, Peter R. Mouton. A New Cervical Cytology Dataset for
    Nucleus Detection and Image Classification (Cervix93) and Methods for Cervical
    Nucleus Detection. https://arxiv.org/abs/1811.09651, last accessed 2019/10/28
9. Loukas C., Kostopoulos S., Tanoglidi A., Glotsos D. Breast Cancer Characterization
    Based on Image Classification of Tissue Sections Visualized under Low
    Magnification Computational and Mathematical Methods in Medicine. Volume 2013,
    p. 7 (2013) http://dx.doi.org/10.1155/2013/829461
10. Zhu X., Huang Z., Cheng H., Shen H. Sparse hashing for fast multimedia search
    ACM Trans. Inform. Syst. 31 (2), 9:1-9:24 (2013)
11. Zhao D. Zou W., Sun G.. A fast image classification algorithm using support vector
    machine. 2nd International Conference on Computer Technology and Development .
    Egypt, (2010) DOI: 10.1109/ICCTD.2010.5645823
12. Zhang S. Efficient kNN classification algorithm for big data. Neurocomputing. 195
    143–148 (2016).
13. Dhivyapriya P., Sivakumar P. Classification of Cancer Dataset in Data Mining
    Algorithms Using R Tool. International Journal of Computer Science Trends and
    Technology (IJCST), 5 (1) (2017)
14. Sharma A., Dey S., A boosted SVM based ensemble classifier for sentiment analysis
    of online reviews, Appl. Comput. Rev. 13, 43–52 (2013)
15. Bayramoglu N, Kannala J, Heikkila J. Deep Learning for Magnification Independent
    Breast Cancer Histopathology Image Classification. 23rd International Conference on
    Pattern Recognition (ICPR). (2016) DOI: 10.1109/ICPR.2016.7900002
16. Berezsky O., Pitsun O., Verbobyy S., Datsko T., Bodnar A. Computer diagnostic tools
    based on biomedical image analysis. 14th International Conference The Experience
    of Designing and Application of CAD Systems in Microelectronics (CADSM), 21-25
    Feb. 2017, Lviv, Ukraine, 388 – 391 (2017)
17. Bae M.S. Quantitative MRI morphology of invasive breast cancer: correlation with
    immunohistochemical biomarkers and subtypes / Min Sun Bae, Mirinae Seo Mirinae
    Se// Acta Radiol. 2015 Mar;56(3):269-75. doi: 10.1177/0284185114524197
18. Berezsky, O., Pitsun, O., Verbovyy, S., Datsko, T., Bodnar, A. Computer diagnostic
    tools based on biomedical image analysis. 14th International Conference The
    Experience of Designing and Application of CAD Systems in Microelectronics,
    CADSM 2017, - pp. 388 – 391 (2017)
19. Berezsky O., Melnyk H., Verbovyi S., Pitsun O., Nykoliuk V., Datsko T. Certificate
    of copyright registration № 75359. Database of digital cytological and histological
    images of the precancerous and cancerous conditions of the breast "BPCI2100". Date
    of registration 12.14.2017.
20. Berezsky O., Batko Yu., Melnyk G., Verbovyy S. Segmentation of Cytological and
    Histological Images of Breast Cancer Cells. IEEE 8th International Conference on In-
    telligent Data Acquisition and Advanced Computing Systems: Technology and Ap-
    plications (IDAACS), Warsaw, Poland (2015) doi: 10.1109/IDAACS.2015.7340745
21. Berezsky, O., Zarichnyi M. Gromov-Fréchet distance between curves. Matematychni
    Studii, 50 (1), 88-92 (2018).
14

     22. Berezsky O., Bat’ko Yu. Algorithm of determination of image contours of biological
         nature. Proceedings of the International conference «Modern problems of radio
         engineering, telecommunications and computer science» TCSET2006. Lviv-Slavske.
         – Lviv, pp. 642-644 (2006)
     23. Berezsky O., Verbovyy S., Pitsun O. Hybrid intelligent information technology for
         biomedical image processing. IEEE 13th International Scientific and Technical
         Conference on Computer Sciences and Information Technologies, CSIT 2018 –
         Proceedings, Lviv, Ukraine (2018).