Fuzzy Clustering of Histopathological Images Using
Deep Learning Embeddings
Salvatore Calderaro1 , Giosué Lo Bosco1 , Riccardo Rizzo2 and Filippo Vella2
1
 Dipartimento di Matematica e Informatica, Università degli studi di Palermo, Via Archirafi 34, 90123 Palermo, Italy
2
 Institute for High-Performance Computing and Networking (ICAR), National Research Council of Italy (CNR), Italy,
90146, Palermo


                  Abstract
                   Metric learning is a machine learning approach that aims to learn a new distance metric by increas-
                   ing (reducing) the similarity of examples belonging to the same (different) classes. The output of these
                   approaches are embeddings, where the input data are mapped to improve a crisp or fuzzy classifica-
                   tion process. The deep metric learning approaches regard metric learning, implemented by using deep
                   neural networks. Such models have the advantage to discover very representative nonlinear embed-
                   dings. In this work, we propose a triplet network deep metric learning approach, based on ResNet50,
                   to find a representative embedding for the unsupervised fuzzy classification of benign and malignant
                   histopathological images of breast cancer tissues. Experiments computed on the BreakHis benchmark
                   dataset, using Fuzzy C-Means Clustering, show the benefit of using very low dimensional embeddings
                   found by the deep metric learning approach.

                   Keywords
                   Histopathological Images Classification, Deep Learning, Metric Learning,


1. Introduction
Digital pathology has emerged with the use of Whole Slide Images (WSI), a kind of digitization
of patient tissue sample usually with a high resolution [1]. WSI can have a resolution of
10.000 × 10.000 pixels and present high morphological variance and various artefacts. A
diagnosis based on WSI can be obtained with a very time-consuming process by a trained
pathologist. Systems based on Machine Learning (ML) algorithms can save time by helping
pathologists focus on the interesting part of histopathological images.
   Today the standard computer hardware is not powerful enough to process the WSI input
using a standard approach, i.e. it is not possible to input the whole image to the ML algorithm.
It is also impossible to down-sample the whole image because it will lose critical features, so
that the WSI processing is usually carried on by dividing the image into tiles or picking random
patches.
   The label of the WSI will be the combination of the labels obtained by processing the tiles or
patches extracted [2]. Due to the morphological variance of the WSI, the labelling process can

Wilf 2021: The 13th International Workshop on Fuzzy Logic and Applications, December 20–22, 2021, Vietri sul Mare,
Italy
" salvatore.calderaro01@community.unipa.it (S. Calderaro); giosue.lobosco@unipa.it (G. Lo Bosco);
riccardo.rizzo@cnr.it (R. Rizzo); filippo.vella@icar.cnr.it (F. Vella)
    © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR Workshop Proceedings (CEUR-WS.org)
be difficult. Many patches can present only healthy tissue even if the global label is different.
Moreover, patch-level processing should be carried out at mixed magnification levels to maintain
an idea of the context of image details.
   Patch analysis also has an intuitive drawback: it cannot detect characteristics bigger than
the patch itself as shape and magnitude of the tumour, which adds another difficulty to the
automatic processing of the WSI images.
   Even considering these drawbacks, patch analysis remains one of the most used techniques
to approach WSI processing, and the development of patch processing and classification sys-
tems is an active research field. Some datasets are considered standard and used to compare
the performances of different systems. One of these is the BreakHis dataset (BH) [3] that is
constituted by 7909 histopathological images of 8 types of breast cancer taken from 82 patients;
a more in the deep description of the BH dataset is in Section 2.
   The present work aims to study the application of embedding techniques on the fuzzy
clustering of histopathological images coming from the BH dataset. Using fuzzy clustering as a
methodology providing a continuous value membership rather than a crisp one could be more
informative and useful for physicians.
   The method used for the embedding creation is metric learning [4, 5] that aims to learn a new
embedding by reducing a chosen distance measures between examples of the same class and
increasing the same distance between objects of different classes. In this embedding, similar
data will be closer, dissimilar data will be far away, and this should improve the performances
of clustering and classification algorithms. In recent years new deep neural networks [5]
have been used to learn a metric that allows to measure the similarity/dissimilarity between
objects and to create a mapping in a new representation space in which the above conditions
are respected. Some applications of deep metric learning, and in general of metric learning,
are [4, 6]: image retrieval, visual tracking, person re-identification, medical problems, face
verification and recognition, signature verification, zero-shot learning etc. The rest of the paper
is organized as follows: Section 2 describes the deep metric learning used method, and the BH
dataset, Section 3 summarizes the experiments conducted to test the embedding effectiveness
for the case of fuzzy C-means, and Section 4 contains the concluding remarks.


2. Materials and Methods
In this work, we want to explore the effectiveness of embedded medical images in a fuzzy
clustering task. The embedding method used is based on a deep neural network called triplet
network, implemented in the framework Pytorch Metric Learning [7]. We scan the embedding
dimensions from 2 to 256 to understand the most effective dimension in fuzzy clustering. The
following subsections deeply describe the BH dataset, the architecture of the triplet network
and its training procedure.

2.1. BreakHis dataset
The BreakHis dataset comprises 7909 images at a resolution of 700 × 460 pixels obtained from
tissue samples of 82 breast cancer patients. These samples are organized into two classes,
Table 1
The BreakHis dataset images.
                                       Adenosis (A)                     444
                          Benign       Fibroadenoma (F)                1014
                          2480         Tubular Adenoma (TA)             453
                                       Phyllodes Tumour (PT)            569
                                       Ductal Carcinoma (DA)           3451
                          Malignant    Lobular Carcinoma (LC)           626
                          5429         Mucinuous Carcinoma (MC)         792
                                       Papillary Carcinoma (PC)         560


benign and malignant, each of them separated into four sub-classes, according to the structure
in Table 1. The images were acquired at different magnification levels (40×, 100×, 200×, 400×),
but we neglect this detail in the following discussion. In this paper, we are only interested
in the dichotomy of benign vs malignant, and Table 1 shows that the number of images in
the malignant class doubles the number of benign images. We implement an augmentation
procedure (flipping and rotating images) to balance the dataset increasing the images from
classes adenosis, tubular and phyllodes tumour. We excluded the fibroadenoma class from the
augmentation because it is the most populated class. Using this strategy, we obtain 5412 images
from the benign class, so the dataset used for the experiments was made by 5429 malignant
images and 5412 benign images. Moreover, in this work, we do not consider any pre-processing
technique on images, e.g. stain-normalization or whitening. We also decided not to consider
the (little) noise on data labelling since the same images for the patient ID:13412 are present in
two malignant sub-classes (Ductal Carcinoma and Lobular Carcinoma).
   Other considerations should be made on the training and test splitting of the dataset after
augmentation. As written before, we do not consider the magnification factor and cluster all
the images as benign and malignant. When splitting these images between the training and
test datasets, we use the same split for the embedding neural network and the classification
or clustering algorithm so that, during the test, the algorithm will calculate the embedding
of a new image (a test image) and the classifier or the clustering algorithm will process an
embedded data never seen before. Moreover, the training/test splitting can be done in two
different modalities: at the image level or patient level. At the image level, the training/test
dataset split neglects the patient information, meaning that some images of the same patient
can be in the training dataset and some in the test dataset; the accuracy calculated using this
splitting is called Image-Level Accuracy (ILA). At the patient level, the patient information
became central, and the training/test split contains 58 patients as part of the training set and 24
patients as part of the test set; this means that all the images of a patient can be in the training
set or in the test set. In this case, the accuracy for classification is called Patient-Level Accuracy
(PLA) and is calculated as reported in [8].
2.2. Metric Learning
In deep metric learning, the embedding is implemented by using one or more neural networks,
the two most used methods use siamese networks [9], and triplet networks [10]. Both architec-
tures are composed of neural networks with sharing weights: siamese uses two neural networks
while triplet networks three. In this work, we adopt a triplet network.
   Triplets networks use 3 inputs: an anchor image 𝑥𝑎 , an image of the same class of the anchor
𝑥𝑝 (the positive example), and an image of a different class 𝑥𝑛 (the negative example).
   The goal is that the distance between anchor and negative sample representation 𝑑(𝑟𝑎 , 𝑟𝑛 ) is
greater than a margin 𝑚 respect to the anchor and positive sample representations 𝑑(𝑟𝑎 , 𝑟𝑝 ).
There are several choices for the metric loss [5, 6]: contrastive loss, pair loss, triplet loss, lifted
structure loss, binomial deviance loss etc. Another key factor in this technique is the choice of
the metric to measure the similarity/dissimilarity between samples. The common choices are
cosine similarity and euclidean distance.
   In this work the chosen loss is triplet margin, defined as:

                          𝐿𝑡𝑟𝑖𝑝𝑙𝑒𝑡 = max {0, 𝑚 + 𝑑(𝑟𝑎 , 𝑟𝑝 ) − 𝑑(𝑟𝑎 , 𝑟𝑛 )}                        (1)

   During the training phase, a mining procedure selects, from the input minibatch, the more
effective triplets to use in the network training [5, 6]; the most common strategies are easy
negative mining, hard negative mining and semi-hard negative mining, which is the one used.
   With semi-hard negative mining, the negative sample is farther away from the anchor with
respect to the positive one, but this distance is not greater than the margin 𝑚. In this case,
𝐿𝑡𝑟𝑖𝑝𝑙𝑒𝑡 < 𝑚 as in the equation below:

                              𝑑(𝑟𝑎 , 𝑟𝑝 ) < 𝑑(𝑟𝑎 , 𝑟𝑛 ) < 𝑑(𝑟𝑎 , 𝑟𝑝 ) + 𝑚.                         (2)


Figure 1: A representation of the network; the vertical double arrow indicates the weights sharing.
2.3. Implementation Details
The network used for the embedding has two parts: the first part implements a feature extraction
and transforms the input images in a feature vector of dimension 2048; this is a ResNet50 network
[11] pre-trained using the ImageNet dataset; a single linear layer constitutes the second part.
   Only the second part of the network is trained by using the 𝐿𝑡𝑟𝑖𝑝𝑙𝑒𝑡 loss in eq. 1; the distance
function used is the cosine similarity, and the optimization algorithm is the RMSprop with a
learning rate of 1 × 10−5 and a weight decay factor of 1 × 10−4 .
   A representation of the network is in Fig. 1.


3. Experiments and Results
For the given training set, the following elaborations are performed on every batch:

    • the deep neural network, ResNet50 process the input data;
    • the mining process identifies all the triplets found in the batch, according to the element
      labels. From this set, the semi-hard triplets are extracted as in eq. 2;
    • the found triplets are used to calculate the loss, through the cosine similarity;
    • the reducer calculates the average among 𝑙𝑜𝑠𝑠 value higher than a threshold, empirically
      set to zero.

   The training process was repeated for the embedding sizes 2, 3, 4, 8, 16, 32, 64, 128, 256.
   Figure 2 reports a representation of these embeddings. The figures for embedding sizes 2 and
3, (Figure 2a, 2b) are shown as 2d and 3d plots. A locus of points with radial symmetry contains
all the mapped samples. This radial symmetry is probably due to the cosine similarity chosen
as the metric to estimate the sample closeness.
   The embedding with sizes greater than 3 (Figures 2c-2i) have been mapped in a 2-d space
through the UMAP Uniform Manifold Approximation and Projection for Dimension Reduction
library [12].

3.1. BreakHis Images Fuzzy Clustering
Differently from crisp clustering, fuzzy clustering allows associating a sample to centroids with
a belonging degree. This property is useful to cope with the uncertainty of the hard attribution
of values to artificially created clusters and have a smooth behaviour for samples far from
centroids and potentially assimilable to multiple clusters.
   In the experiment, as a fuzzy clustering solution, we have used fuzzy C-means [13]. The
used implementation adopts the Euclidean distance as a clustering metric. To evaluate the
goodness of the fuzzy clustering, some measures have been adopted, such as Partition Coefficient
(PC)[14], Partition Entropy (PE)[15], Modified Partition Coefficient (MPC) and Modified Partition
Entropy(MPE)[16].
   The Partition Coefficient (PC) measures the average fuzzy degree of belonging of the samples
to the clusters using the partition matrix: the larger is the value, the better is the partition [14].
The Partition Entropy (PE) measures the average entropy of the fuzzy belonging of a sample to
   (a) Full dataset    Embedding (b) Full dataset      Embedding (c) Full dataset       Embedding
       (emb. size=2)                 (emb. size=3)                   (emb. size=4)


   (d) Full dataset    Embedding (e) Full dataset Embedding (f) Full dataset Embedding
       (emb. size=8)                 (emb. size=16)             (emb. size=32)


   (g) Full dataset Embedding (h) Full dataset Embedding (i) Full dataset Embedding
       (emb. size=64)             (emb. size=128)            (emb. size=256)
Figure 2: Bidimensional Plot of dataset embeddings. Figures (a) and (b) show the original coordinate
space in 2d and 3d. The colours differentiate the object into two classes, benign and malignant. Figures
(c)-(i) show the data with embedding sizes ranging from 4 to 256 visualized in 2d by the UMAP approach.


a cluster. The lower is the value, the better is the partition [15]. MPE and MPC are modifications
of PE and PC that take into account the number of centroids and the number of samples. MPE
should be minimized while MPC should be maximized [16].
   Figure 3 shows the value of four metrics of the fuzzy clustering process when input data
are processed with triplet loss. The PE and MPE values, which result coincident for all the
experiments, show that the values are lower for a reduced number of centroids, and the entropy
values increase proportionally to the dimension of the embeddings. These metrics advise using
few clusters and a reduced embedding size. The values of PC and MPC are coherent and tend to
   (a) Fuzzy clustering metrics (b) Fuzzy clustering metrics (c) Fuzzy clustering metrics
       (emb. size=2)                (emb. size=3)                (emb. size=4)


   (d) Fuzzy clustering metrics (e) Fuzzy clustering metrics (f) Fuzzy clustering metrics
       (emb. size=8)                (emb. size=16)               (emb. size=32)


   (g) Fuzzy clustering metrics (h) Fuzzy clustering metrics (i) Fuzzy clustering        metrics
       (emb. size=64)               (emb. size=128)              (emb. size=256)
Figure 3: Metrics plot for fuzzy clustering with a dimension of embeddings ranging from 2 to 256. PE
is plotted with orange triangles, MPE is plotted with red diamonds, PC is plotted with blu dots, MPC
is plotted with green stars. In all the plots, MPE is superimposed to PE


decrease with a larger number of clusters. When the embedding size is larger, the values are
even lower. Also, for these metrics, the best results are obtained with lower dimensions and few
clusters.
   To evaluate the clusters obtained with different embedding dimensions we considered the
sets of images whose membership is in the range [0.45,0.55], i.e. the ones with an high level of
uncertainty. The number of these images is reported in Table 2 for different embeddings (rows)
and classes (columns). It is possible to notice that the number of uncertain images grows with
embedding size. Moreover, there are 8 images which belongs to all the uncertain sets related to
an embedding. These images are reported in Figure 4, and are composed by an adenosis image,
two fibroadenomas, one tubular image and four ductal carcinoma images. It is worth noting
that ductal carcinoma is the most common tumor in fact almost an half of the images in the
original dataset are of this kind (3451 out of 7909), while fibroadenoma is often misclassified as
malignant [3].
Table 2
Distribution of uncertain images (memberships in the range [0.45,0.55]) for different embedding dimen-
sion (rows). In columns the different classes of images are represented by the following labels: Adenosis
(A), Fibroadenoma (F), Tubular Adenoma (TA), Phyllodes Tumour (PT), Ductal Carcinoma (DC), Lobular
Carcinoma (LC), Mucinous Carcinoma (MC), Papillary Carcinoma (PC).
                                      Fuzzy c-means uncertain images
                      Emb.size   A      F TA PT DC LC MC                 PC    Total
                         2        1     8    2    1    14    2     5      6      39
                         3       12     6    10   5     9    5     4      7      58
                         4        7     4    9    5    32    4     5      6      72
                         8       23    13    4    3    25    4     7     15      94
                        16       12    12 17      8     9    3    10     11     82
                        32       24    14     8  14    33    3    16     12     124
                        64       11    23     7  17    44    5    11     10     128
                        128      21    27 28 21        26    7     8      9     147
                        256      35    27 35 16        30    6     9      9     167


                (a)                      (b)                  (c)                      (d)


                (e)                      (f)                  (g)                      (h)
Figure 4: Some misclassified images: image (a) is an example of Adenosis, images (b) and (c) are from
class Fibroadenoma, image (d) is from class Tubular Adenoma, and images from (e) to (h) are from
Ductal Carcinoma class.


4. Conclusions
The adoption of fuzzy clustering in the medical field could be very useful for providing effec-
tive decision support systems. The outcomes of a fuzzy clustering provide continuous value
membership that could be more informative and useful for a final decision, particularly in
the case of histopathological image classification. In this work, we have proposed a triplet
network deep learning model to derive effective embeddings for fuzzy clustering. The images
projected in these embeddings lead to a better clusterization even at a very low dimension. This
is a very important result due to the relevance of the data dimension in every supervised and
unsupervised classification (e.g. curse of dimensionality and computational issues). In the near
future, we plan to study how different loss functions used for the metric learning can affect the
final classification. Moreover, we want also to investigate the sensitivity of the embeddings to
the used clustering distance, with the final goal of providing a decision support system tool for
the physicians in the specific problem of histopathological image classification.


References
 [1] N. Dimitriou, O. Arandjelović, P. D. Caie, Deep Learning for Whole Slide Image Analysis:
     An Overview, Frontiers in Medicine 6 (2019) 264. doi:10.3389/fmed.2019.00264.
 [2] G. Campanella, M. G. Hanna, L. Geneslaw, A. Miraflor, W. K. S. et al., Clinical-grade
     computational pathology using weakly supervised deep learning on whole slide images,
     Nature Medicine 25 (2019) 1301–1309.
 [3] F. A. Spanhol, L. S. Oliveira, C. Petitjean, L. Heutte, A Dataset for Breast Cancer Histopatho-
     logical Image Classification, IEEE Transactions on Biomedical Engineering 63 (2016).
 [4] K. Musgrave, S. Belongie, S.-N. Lim, A metric learning reality check, in: European
     Conference on Computer Vision, Springer, 2020, pp. 681–699.
 [5] M. Kaya, H. Bilge, Deep metric learning: A survey, Symmetry 11 (2019).
 [6] X. Wang, X. Han, W. Huang, D. Dong, M. R. Scott, Multi-similarity loss with general pair
     weighting for deep metric learning, 2020. arXiv:1904.06627.
 [7] K. Musgrave, S. Belongie, S.-N. Lim, Pytorch metric learning, arXiv preprint
     arXiv:2008.09164 (2020).
 [8] Y. Benhammou, B. Achchab, F. Herrera, S. Tabik, BreakHis based breast cancer automatic
     diagnosis using deep learning: Taxonomy, survey and insights, Neurocomputing 375
     (2020) 9–24.
 [9] J. Bromley, I. Guyon, Y. LeCun, E. Säckinger, R. Shah, Signature verification using a
     "siamese" time delay neural network, in: Proc. of the Int Conf on Neural Information
     Processing Systems, NIPS’93, Morgan Kaufmann Publishers Inc., 1993, p. 737–744.
[10] E. Hoffer, N. Ailon, Deep metric learning using triplet network, in: A. Feragen, M. Pelillo,
     M. Loog (Eds.), Similarity-Based Pattern Recognition, Springer International Publishing,
     Cham, 2015, pp. 84–92.
[11] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: 2016
     IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 770–778.
     doi:10.1109/CVPR.2016.90.
[12] L. McInnes, J. Healy, Umap: Uniform manifold approximation and projection for dimension
     reduction, ArXiv abs/1802.03426 (2018).
[13] J. C. Bezdek, R. Ehrlich, W. Full, Fcm: The fuzzy c-means clustering algorithm, Comput-
     ers Geosciences 10 (1984) 191–203. doi:https://doi.org/10.1016/0098-3004(84)
     90020-7.
[14] J. C. Bezdek, Numerical taxonomy with fuzzy sets, Journal of mathematical biology 1
     (1974) 57–71.
[15] J. C. Bezdek, Cluster validity with fuzzy sets (1973).
[16] R. N. Dave, Validating fuzzy partitions obtained through c-shells clustering, Pattern
     recognition letters 17 (1996) 613–623.