Isolation of Tumor Areas of Histological Images for Assessment of Quantitative Parameters Vassili Kovaleva, Valery Malysheva, Artem Piddubnyib, Alona Moskalenkoc, Anatolii Romaniukb* a United Institute of Informatics Problems of the National Academy of Sciences of Belarus (UIIP NAS of Belarus), Surganov str, 6, Minsk, 220012, Belarus b Sumy State University, Department of Pathology, Rymskiy-Korsakov str. 2, Sumy, 40007, Ukraine c Sumy State University, Department of Computer Science, Rymskiy-Korsakov str. 2, Sumy, 40007, Ukraine Abstract The research object are biomedical whole slide histological images of breast cancer. The aim of the work is to develop methods, algorithms and basic elements of a software for automatic search of tumor sites, adaptive assessment of the immunohistochemical markers expression and quantitative assessment of the analysis results. At this stage, we have analyzed difficulties of whole slide histological scans analysis. We have developed algorithms for background separation and color normalization. A search algorithm has been implemented for semi-automatic selection of tumor areas on whole slide histological images. Keywords whole slide images, segmentation, clustering, image processing, breast cancer 1. Introduction The analysis of whole slide images is an extremely laborious process, so the implementation of automated diagnostic algorithms in this area is relevant. However, histological whole slide images have a number of features that complicate the development of such algorithms. These features are: a high level of tissue diversity both in one image and between different images, hierarchy and a large amount of graphic information [1]. The development of an algorithm and its software implementation for the automatic selection of tumor areas in histological whole slide images is a difficult task [2]. Based on that, pre-processing of whole slide images is required. It should include normalization of the color distribution in whole slide histological images and the selection of the image area with specific order of tissue localization to reduce the operating time and to prevent the analysis of the background. A search algorithm for semi-automatic detection of tumor areas by detection of various image descriptors has been developed and implemented. Features of the whole slide histological images analysis We used whole slide images of scanned tissue samples with background illumination. Hematoxylin and eosin stained slides, as well as immunohistochemistry samples were used as markers. There are many manufacturers of histological scanners capable to make whole slide images of various modalities. Each scanner saves and compresses the image in different formats. It slows down the development of algorithms. For example, DICOM format was developed in radiology to solve this problem. The standardization of data format in the whole slide histological imaging industry is also very active now. * IDDM’2020: 3rd International Conference on Informatics & Data-Driven Medicine, November 19–21, 2020, Växjö, Sweden EMAIL: vassili.kovalev@gmail.com (A. 1); malyshevalery@gmail.com (A. 2); a.piddubny@med.sumdu.edu.ua (A. 3); a.moskalenko@cs.sumdu.edu.ua (A, 4); pathomorph@gmail.com (A, 5). ORCID: 0000-0002-8154-5875 (A. 1); 0000-0002-8737-8879 (A. 2); 0000-0002-6508-0131 (A. 3); 0000-0003-3443-3990 (A. 4); 0000-0003- 2560-1382 (A, 5). ©️ 2020 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings (CEUR-WS.org) Most formats have hierarchy, so the image is stored as 2D blocks to significantly speed up access to small square areas. So image areas are stored in a lower resolution so there is no need to load all small blocks to show the entire image [3] (Figure 1). Figure 1. The structure of whole slide image in the computer memory [4] Machine learning methods have to work with images with various artifacts and color variability. Artifacts can appear both during the tissue samples processing and scanning. There could be a variability of chemical markers, tissue defects, scanner artifacts. Some artifacts are related to the scanner by itself. The scanner takes a picture at maximum magnification and then collects a whole slide image, so artifacts such as lighting and focus differences, stitching and calibration problems may appear [3] (Figure 2). In a couple with the high variability of tissue images, this complicates the application of deep learning methods images. In addition, the high resolution of whole slide images makes it impossible to apply deep learning methods "directly" for images. Thus, most existing solutions use the division of whole slide images into small areas that are used in the dataset. A short list of problems to be solved is presented below: - high images resolution; - lack of context (if the image is viewed in sections); - image artifacts; - variability of color distribution because of lighting and staining differences; - variability of tissue samples. Figure 2. Artifacts in whole slide images Normalization of whole slide images Color normalization of histological whole slide images is necessary because of different scanning conditions, such as different scanners characteristics and illumination, different amounts of chemical markers for tissue staining. The following algorithm is the main method for color normalization of whole slide images: - exclusion of areas that are not suitable for color scheme detection (for example, a background with no tissue); - change of RGB color model to another; - detection of basic vectors (their linear combination defines the color model); - the reverse transformation of these vectors into RGB color model to show the primary color combinations of image (Figure 3); - transformation of the image into this colors (Figure 4); - replacement of the primary colors with the reference ones before the start of the algorithm; - inverse conversion from the concentration values of the primary colors and RGB reference color values to RGB color model. In this work, two approaches for normalization of the color distribution were tested. Both of them used the transformation of the RGB value of the image pixels into the optical density values according to the formula (1) [4]: OD=− log 10 ( I ), (1) Pixels with too low optical density were not analyzed, as their signals were considered as the background with a white color. The first algorithm used SVD decomposition. After that vectors normalization according to the length of the vectors, angles between vectors and directions of the SVD expansion were analyzed [5]. In the second algorithm, instead of SVD decomposition, the covariance matrix of RGB image channels were used and vectors were obtained from the eigenvectors of this matrix [6]. Angles were measured as angles between defined by coordinates in the new space vectors and the resulting vectors. The 1st and 99th percentiles of the obtained vectors angles distribution in comparison to defined basis vectors were used as the main vectors. Further transformations are described in the generalized algorithm above. Thus, the color scheme of all images corresponds to the same base colors of chemical markers. Further, algorithms for image normalization were applied, in particular equalization of the image brightness histogram and deletion of the 1st and 99th percentiles of image pixel intensities to partially avoid the various image artifacts influence. These two normalization algorithms were applied to the Y channel in the YCbCr image representation, since only the image contrast is normalized, and the color scheme should remain the same. Figure 3. Colors of the whole slide images region component (hematoxylin, eosin, residual) Figure 4. Decomposition of a whole slide image area into components corresponding to chemical markers. Left to right: original image, hematoxylin component, eosin component Segmentation of a tissue sample in an image The second essential component is a separation of the tissue image area from the main background. In this work, a number of algorithms have been used to achieve this result. Considering large dimensions of image, for segmentation we used the whole-slide image layer with smallest magnificatoin (16 times lower than maximum available for the image). First, we blurred an image to increase the smoothness and stability of the final regions. To separate background we utilized S channel of the HSV image representation cause it shows the color saturation in each pixel, which is optimal for detection of mainly white or black background. The algorithm sequentially applies flood fill algorithm to saturation map of the image (S channel in HSV representation of image) starting from empty pixels of the images, considering that each pixel can belong to only one region. After that step we acquire a large number (more than several hundreds of regions). In order to merge the regions into several large ones the algorithm calculates various descriptors for the founded regions. At the moment, descriptors include image channels histograms in RGB, HSV color spaces as well as histograms of histology markers intensities obtained by applying decomposition algorithm. Instead of the descriptor itself, we used the result of its transformation by the principal component method to optimize the clustering algorithm time. This allows to reduce the number of descriptor elements to 16 and significantly speed up the calculations. Afterwards we applied the K-means algorithm to cluster regions into 3 groups. We supposed that background and tissue regions will be in separate groups due to highly different saturation values. The third groups was added to prevent mistakes due to artifacts. So the third group have to be merged with background or tissue regions group. This decision is taken based on relation between groups average saturation values. This method has a lot of parameters, which allowed us to adjust the number of areas and algorithm quality. These parameters are the size of the minimum area, connectivity, the maximum allowable difference between pixel intensities within the same region, etc. In total, a sufficiently reliable algorithm for analysis of immynohistochemistry and hematoxylin and eosin stained tissue samples was developed. The steps of the algorithm are shown in Figure 5. Figure 5. An algorithm of whole slide image: a) original image, b) found regions, c) combined regions of highlighted area with tissue sample Semi-automatic segmentation algorithm We have developed a semi-automatic algorithm for the selection of the tumor area at whole slide histological images. A search of similar areas by various descriptors and metrics was the main idea of the algorithm. We implemented the algorithm as web-service. Hence we include details of client-server communication in the algorithm description. After the image upload to the server, the image is divided into small square regions of equal dimension, which are called tiles. Every region represents a small area of the whole-slide image, and the tile is the smallest unit for the algorithm to process. Thus we create the database of whole-slide image descriptors by calculating different descriptors. Next paragraph will cover the descriptors, which were used in the algorithm because their selection has a high impact on the results of the algorithm. The process of descriptor calculation takes 5-20 minutes depending on the size of the image and the descriptor. Such duration times are large in comparison with the expected search query response time. Thus, after descriptor calculation user can select one region of rectangular shape and any size to use it as a base for the search algorithm. After receiving the selected region coordinates and required descriptor from the user, the server retrieves selected region from the whole-slide image and calculates its specified descriptor. Using that descriptor the server can find distance metric from the selected region descriptor to descriptors of each tile. Distance metric can be L1 metric, L2 metric or correlation metric. Afterwards, the server ranks tiles based on the minimum distance between descriptors and top 10 tiles are sent to the user with a distance map for the whole image (Figure 6). As a result, the user can assess the obtained results. The first descriptor was the color distribution histogram. For this one, the image RGB channels were combined into one by merging the values. The R channel value used the first 3 bits of the number, the G channel value used the next 2 bits, and the B channel value used the last 3 bits. After that, a histogram with 256 bins was built. To increase the efficiency, the size of the histogram was restricted by 256 elements. Further, an improved algorithm for descriptor calculation by the histogram was developed. It was an adaptive color histogram, which is a histogram of the colors distribution from a 256-color palette and consists of the number of elements equal to the size of the palette. Palette was prepared by color quantization through clustering of meaningful pixel colors from different whole-slide images. The last descriptor was the color co-occurrence matrix. This matrix was built also with a palette of colors. So, i and j) element of the matrix has the number of color pixels in the i-th place in the palette, which border the pixel of the color in the palette at j) matrix element in the m place. All descriptors were normalized to the resolution of the area. A search for similarities and descriptors makes it possible to build a similarity map (Figure 6) for a selected area, allows to find similar tumor areas within the same image. Figure 6. Example of a similarity map Conclusions We developed two algorithms for subtasks of the main problem. The first segmentation algorithm reliably selects tissue. The algorithm can be used to reduce the area of the image, which will be processed what results in faster work of all pipeline. In contrast to the first algorithm of segmentation, the semi-automatic algorithm allows user control of search parameters and the region of interest. Selection of the region of interest allows to select not only tissue but different types of tissue and highlight them on the images. For example, selection of a small region of interest with normal tissue will result in more intense highlight for normal tissue than malignant one, what can be noticed easily on the heatmap. We analyzed the features and problems of whole slide image processing and analysis. Two algorithms have been proposed to solve some problematic aspects. The search algorithm for identification of the chemical markers color allows the normalization of the color space of a whole slide histological image. The whole slide tissue segmentation algorithm reduces the area for processing and reduces the computation time. An algorithm for semi-automatic tumor areas search by the build of a similarity heat map of the selected region to the rest of the image regions was developed. Acknowledgements This research has been performed with the financial support of joint UKRAINE-BELARUS R&D project «Development of an automated program for differential diagnosis of breast tumors with a morphometric evaluation of the receptor status of cancer cells» in 2019-2020. The authors declare no conflict of interest. References 1. V. Kovalev, Y. Diachenko, V. Malyshev et al. Comparative features of open source software products for the development of an automated breast cancer diagnostic program. EUMJ (2019) 377-385. - DOI: https://doi.org/10.21272/eumj.2019;7(4):377-385. 2. V. Gargin, R. Radutny, G. Titova, et al. “Application of the computer vision system for evaluation of pathomorphological images”, in: 2020 IEEE 40th International Conference on Electronics and Nanotechnology, ELNANO 2020 - Proceedings, pp 469-473. doi:10.1109/ELNANO50318.2020.9088898. 3. N. Dimitriou, O. Arandjelovic, P.D. Caie. Deep Learning for Whole Slide Image Analysis: An Overview. Frontiers of Medicine (2019) 6:264. doi: 10.3389/fmed.2019.00264 4. DICOM Whole Slide Imaging. – URL: http://dicom.nema.org/Dicom/DICOMWSI/. 5. M. Macenko et al. “A method for normalizing histology slides for quantitative analysis”, in IEEE International Symposium on Biomedical Imaging: From Nano to Macro 2009 - Proceedings, pp 1107- 1110. 6. P. Bankhead et al. QuPath: Open source software for digital pathology image analysis. Scientific Reports (2017) 7(1) 16878.