Machine Learning for Automated Gating of Flow Cytometry Data Muhammad Suffian1,* , Sara Montagna1 , Alessandro Bogliolo1 , Claudio Ortolani1 , Stefano Papa1 and Mario D’Atri1 1 University of Urbino Carlo Bo, Urbino, Italy Abstract Manual gating is the traditional procedure adopted to identify cellular clusters from multi-dimensional datasets generated with flow cytometry, a tool for detecting and monitoring different diseases by acquiring single cell features. However, the identification of cellular subpopulations by manual gating is a time-consuming process strongly affected by human expertise. Automated analysis supported by computational systems, such as machine learning approaches, can radically change the way flow cytometry data are elaborated. In this paper we applied a suite of machine learning classifiers for analysing samples of peripheral blood acquired with flow cytometry. The goal was to identify CD4+ lymphocytes population. Four ML classifiers are examined —Support Vector Machine, Random Forest, Multilayer Perceptron and Logistic Regression using stratified 10-fold cross-validation. All the four models perform very well, with a balanced accuracy score > 0.945. We come to the conclusion that all four algorithms classify the events of interests with promising results, paving the way for further investigations. Keywords Flow Cytometry, Automated Gating, Supervised Machine Learning 1. Introduction Flow cytometry (FL) is an experimental technique that enables to measure cellular properties at a single-cell resolution by quantifying, for instance, antigens expressed on the cell surface and various physical properties [1]. From multi-dimensional datasets generated from FL, manual gating is performed to identify cellular clusters with similar properties. As such, it is adopted in detecting and monitoring different diseases, such as those of the immune system. Given the progress in the instrumentation used for cell cytometry, the number of features that can HC@AIxIA: 1st AIxIA Workshop on Artificial Intelligence For HealthCare, November 28 - December 02, 2022, University of Udine, Udine, Italy * Corresponding author. $ m.suffian@campus.uniurb.it (M. Suffian); sara.montagna@uniurb.it (S. Montagna); alessanro.bogliolo@uniurb.it (A. Bogliolo); claudio.ortolani@uniurb.it (C. Ortolani); stefano.papa@uniurb.it (S. Papa); mario.datri@uniurb.it (M. D’Atri) € https://www.uniurb.it/persone/sara-montagna (S. Montagna); https://www.uniurb.it/persone/alessandro-bogliolo (A. Bogliolo); https://www.uniurb.it/persone/claudio-ortolani (C. Ortolani); https://www.uniurb.it/persone/stefano-papa (S. Papa); https://www.uniurb.it/persone/mario-datri (M. D’Atri)  0000-0002-1946-285X (M. Suffian); 0000-0001-5390-4319 (S. Montagna); 0000-0001-6666-3315 (A. Bogliolo); 0000-0001-9291-0527 (S. Papa) © 2022 Copyright ©2022 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) be acquired is continuously increasing, making the identification of cellular subpopulations by manual gating a time-consuming process strongly affected by human expertise. The auto- mated analysis supported by computational systems, such as machine learning approaches, can radically change the way flow cytometry data are elaborated [2]. The general computational cytometry workflows can be classified into two categories based on the methods employed: discovery analysis, i.e., the detection of unknown, unique cell populations, versus focused analysis i.e., the detection of known well-defined ones. Automation can potentially lessen variability in the data analysis process in both situations. Cell populations that are neglected in successive manual gating procedures, such as cells gated out in earlier steps, can be found using automated technologies in discovery mode. When using focused analysis, the cell populations of interest are precisely specified, and the data analysis procedure adheres to a set of techniques that is likely to be validated and authorised. Automated technologies can lessen human effort by automatically classifying cases as healthy or diseased and only raising questions about certain cases for people to consider [3]. The computational approaches can be further categorised: automating the manual gating process based on rules or cell densities (flowDensity [4], OpenCyto [5]); clustering of flow cytometry data (cells, events) based on similar characteristics in high-dimensional space (Flow- SOM [6], Phenograph [7], SPADE [8]); and the supervised classification in which the data is annotated to train the learning model so that it can classify unlabelled data, i.e., cell populations or events (FlowLearn [9], ACDC [10], DeepCyTof [11]). Even though literature already reports interesting results in this field, they still are not a clinical practice. In this paper, we applied a suite of machine learning classifiers for analyzing samples of peripheral blood acquired with flow cytometry. The goal was to identify the 𝐶𝐷4+ lymphocyte population. We show the effectiveness of our approach for classifying cells with a series of tests, cross-verifying the trained models on various data files, and comparing the cell classifications with those acquired by manual gating. Four ML classifiers are examined —Support Vector Machine, Random Forest, Multilayer Perceptron and Logistic Regression —using stratified 10- fold cross-validation. All four models perform very well, with a balanced accuracy score of > 0.945. 2. Background Flow cytometry is a standard method for analysing and quantifying biological data. The capabilities of cytometry have increased, giving rise to several data-analysis methodologies [2]. The flowDensity [4] is an approach that performs a computational analysis of the flow cytometry data by automating the manual gating process based on the sequential bivariate gating method. The properties of the density distribution are used to select the ideal cut-off for each unique marker using 2D scatter plots. This method has limitations when the target is to identify unknown populations, as it looks at two dimensions simultaneously. As a result, unknown populations can be easily missed. Another method OpenCyto [5] replicates manual gating, facilitates data analysis, and provides interpretable results by incorporating domain- specific knowledge. It concentrated on finding uncommon, antigen-specific T-cell populations and discovered a novel subgroup of 𝐶𝐷8 T-cells with a vaccine-regimen-specific response that could not be found by manual analysis. Several machine learning approaches have been devised to identify new or prede- termined cellular populations and perform automated analysis of flow cytometry data [12]. These include supervised and unsupervised learning approaches. In the former, the data is annotated to train the learning model so that it can classify unlabelled data (i.e., cells, samples). The latter performs multi-channel (multivariate) analysis, i.e., grouping cells with similar characteristics through clustering analysis. Our contribution falls in the former case. The FlowSOM [6] approach used self-organizing map (an unsupervised technique for clustering and dimensionality reduction) to visualize and cluster the data from flow cytometry. It employed a substantially higher number of clusters when less number of cell types were expected. FlowSOM can be used as a starting point for analysis or as a tool to see the data after performing manual gating. The position and identification of normal mononuclear-cell subsets in viSNE displays were determined by analyzing individual peripheral blood samples that either included a neoplastic or reactive T-cell lymphocytosis alongside a cohort of 10 healthy samples [7]. A PhenoGraph and viSNE-based combined method was applied to peripheral blood mononuclear cells stained with a single 8-color T/NK cell antibody combination. The numbers of neoplastic T-cells discovered with PhenoGraph/viSNE coincided with those discovered using manual gating. Another cell density-based approach that identified a functionally different cell population without utilizing any particular underlying characteristics is called Spanning-tree Progression Analysis of Density-normalized Events (SPADE) [8]. SPADE was applied to two independent sources of cytometry data in four steps: down-sampling based on density, clustering, connecting clusters with minimal spanning tree, and upsampling to restore the cells as a final output. There are drawbacks to SPADE, such as the algorithm’s halting condition depends on the number of clusters; if the number of clusters is too low, the SPADE tree cannot accurately represent the cloud’s shape. If this value is very high, it becomes difficult to understand the SPADE tree. Similar to flowDensity, another piece of software called flowLearn uses density charac- teristics but does not require the user to adjust hyper-parameters to achieve the best results manually. Instead, it operated in a semi-supervised manner necessitating the establishment of thresholds by a human expert for gating one or a few distinctive samples. Then, these criteria are automatically applied to all data using a process known as derivative-based density alignments. It predicted gates on additional samples using a limited number of manually gated samples with density alignments. A drawback in this approach could be the gating of a limited number of samples, and density-bound rules could over-fit results. A supervised learning-based approach Automated Cell-type Discovery and Classification (ACDC) [10] automated the cell annotation by employing biological information as an input parameter. The ACDC method consists of two parts. First, a user-specified table of markers and cell labels is converted into a high-dimensional space. Second, it used random walks to execute semi-supervised classification to gather data from every point and categorize the events at the single-cell level. The ACDC has the drawback that each marker label is binary (present or absent). In contrast, intermediate markers are used to identify cell populations of interest in real life [13]. The DeepCyTOF adopts a different point of view for gating; it needs labelled cells from one sample to perform supervised calibration between a source domain distribution (reference sample) and many target domain distributions (target samples). A multi-autoencoder neural network is the foundation of the DeepCyTOF. In reality, differences across equipment were found to be relatively frequent in CyTOF investigations. These differences might be a weakness of this methodology that causes considerable batch effects in datasets with samples taken at various runs. Consequently, observable differences might be found between the data distributions of the training data (manually gated reference sample) and the remaining unlabeled test data (the remaining samples). 3. Materials and Methods This section is divided into two subsections: (i) details about the data set and pre-processing of the data, and (ii) experiments with machine learning models. 3.1. Data Set and Data Pre-Processing The data exploited in the study have been derived by the routinarian diagnostic activity per- formed by the Center of Cytometry of Urbino University. Data set were randomly selected and anonymized in order to make impossible the identification of the source. For every peripheral blood sample, data were acquired on intrinsic parameters and antigen expression displayed by white blood cells. The analyses were performed through a commercial flow cytometer and focused on the following parameters: i) Forward Scatter (FSC), ii) Side Scatter (SSC), iii) CD3 antigen expression, iv) CD4 antigen expression, v) CD8 antigen expression, vi) CD16 and CD56 combined antigen expression, and vii) C45 antigen expression. In all, 15 subjects were analyzed. For each parameter, data related to the pulse area were considered. The cytometric files produced by the analytical runs were then stripped of metadata and exported in csv format. Consequently, the dataset consists of fifteen (15) different data files, where each row is a cell, and each column contains the corresponding value of one of the 8 parameters as mentioned above (Table 1) with no missing values. Data records are labeled with binary values, i.e., gated and ungated (1 used for gated and 0 for ungated records). The extraction of gated and ungated records was possible due to the combined use of a commercial program to select the clusters of interest (Cytopaint1 ) and an unpublished program for the management of flow cytometry standard (FCS) files (Wizard) provided by one of us (MDA). In particular, in each experiment, CD4+ T cells (gated for CD4 expression) are identified by manual gating performed by experienced operators. The gating logic performed on each data file to filter out the gated records is as follows: 1. The creation of a first gate on parameters CD45+ vs SSC (denoting it as Gate-1) and selecting lymphocytes; 2. The results of Gate-1 are expressed for parameters CD3 and CD19 and a gate was traced on the events CD3+ CD19 and CD3 CD19+ (denoting it as Gate-2) 3. The results of Gate-2 are expressed for parameters CD4 and CD8, and a gate is traced on the events CD4+ CD8 (denoting it as Gate-3 which constitutes the population of T Helper 1 http://leukobyte.com/cytopaint-classic/ Table 1 Descriptions of markers used for flow cytometry data analysis. Attribute Description FSC Forward Scatter (assimilable to size). SSC Side Scatter (assimilable to granularity). CD3 Pan-T cells marker (mature T cells). CD16 + CD56 NK cells markers. CD4 Helper T cells marker. CD19 Pan-B cells marker (mature and immature B cells). CD8 Cytotoxic T cells marker. CD45 Common leukocyte antigen, hematopoietic lineage marker. Lymphocytes CD3+ CD4+ CD8-) and another gate on the events CD4 CD8+ (denoting it as Gate-4). After the hierarchical gating process, a subset of records is obtained based on CD4+ cells which will be part of class 1. The resulting dataset is unbalanced between the two classes. The percentage of CD4+ manually gated samples is presented in Table 2 against the total sample size of each data file. Due to the critical nature of medical data sets, data balancing techniques are not often recommended. Thus, we applied the k-fold stratified cross-validation technique for training and testing the ML classifiers because it maintains the same class ratio across the K folds as the ratio in the original dataset. To conclude, our approach avoids information loss in the cell gating stage by directly using the labelled flow cytometry data. Table 2 Size of data files and percentage of 𝐶𝐷4+ gated samples. File No. Data Size Percentage of CD4+ 1 30,000 8239 (27%) 2 30,000 3891 (13%) 3 30,000 2645 (8%) 4 30,000 6443 (21%) 5 30,000 3460 (12%) 6 30,000 970 (3%) 7 30,000 1609 (5%) 8 30,000 4937 (16%) 9 30,000 3618 (12%) 10 30,000 3432 (11%) 11 30,000 7141 (24%) 12 28,683 1082 (4%) 13 30,000 2839 (9%) 14 30,000 4675 (16%) 15 30,000 3913 (13%) Figure 1 illustrates the 2D representation of the dataset yielded with manifold learning technique, namely t-Distributed Stochastic Neighbor Embedding (T-SNE) [14] through which, we can see that the data samples can be discriminated easily with ML approaches, even though they are not linearly separable. Figure 1: 2D visualization of flow cytometry gated data with T-SNE. 3.2. Machine Learning Classifiers and Experimental Setup Our study compares the outcomes of allocating cell events to discrete cell populations (gated and ungated cells) using automated gates with the results from manual gates produced by expert analysis. In particular, the classification goal is the automatic identification of T Helper Lymphocytes CD4+, which constitute the gated population, against the ungated ones. We adopted supervised ML models, and trained the algorithms with gates supplied by experts. A suite of ML classifiers is employed for classification. The reason is to use various types of classifiers to observe the accuracy and over-fitting issues under the different classification mechanisms, i.e., decision tree-based, gradient-based, neural network-based and able to classify non-linearly separable data. These classifiers, with their brief descriptions, are listed below (the mathematical equations for these classifiers are omitted as these are well-understood methods): 1. Random Forest (RF)—is a meta estimator that averages the results of many decision tree classifiers by fitting different sub-samples of the dataset to increase predicted accuracy and reduce over-fitting. 2. Logistic Regression (LR)—is a parametric regression technique that involves fitting a line (or a curve) to the data and then using the gradient descent function to distinguish between different output classes. In this study, we utilized LR with gradient descent, which fits the dataset with a curve. Figure 2: Machine learning pipeline for flow cytometry data. 3. Support Vector classifier (SVC)—similar to LR, it also fits a curve on the dataset; however, the curve itself tends to maintain a maximum margin on both sides [15]. We employed a radial basis function kernel in this research to separate the data points because the dataset appears not linearly separable. 4. Multi-Layer Perceptron (MLP)—is a basic artificial neural network (ANN) type. We used MLP with one hidden layer of 100 neurons, and the other hyper-parameters are kept the same as the default values provided in the scikit-learn2 Python library. For the training and validation phase, we mainly designed two experiments (each maintains details about the sub-experiments) to examine the classification of gated and ungated samples. Exp. 1 The first experiment applies the ML classifiers listed above to each single data file, by extracting the train and test-set (80-20 split); Exp. 2 The second experiment is conducted by training the classifiers on 10 subjects randomly selected (accumulated training data) and testing the rest on the remaining 5 subjects. Figure 2 illustrates the general ML pipeline adopted for the flow cytometry data. The flow cytometry standard (FCS) data files are used for preprocessing. The processed FCS data then subjected to apply the hierarchical gating (the gating process is described in section 3.1). After the gating process, cell-annotation is performed and CD4+ T cells are separated from the rest of cell populations. The supervised machine learning pipeline is used to train and validate the models. A stratified 10-fold cross-validation (CV) technique is employed to validate the classification performance of trained models. The results for CD4+ T cells are compared with results of trained ML models to evaluate the performance of all models. The training data for each fold were obtained in equal amounts to equalize the class occurrence frequencies. These data were then used to train a model. The validation set is then used to validate the model. The experiments are conducted using a Python Collaboratory environment. Experimental results are described in Section 4. 2 https://scikit-learn.org/ 4. Results and Evaluations In this section, we present the results for the two types of experiments. Exp. 1 Results of the first experiment are reported in Table 3 for only 3 data files, since all classifiers’ results showed similar performances on the different data files, which likely means that the datasets’ distributional variations weren’t significant. The average scores for both metrics are presented. Table 3 Results of machine learning classifiers for 𝐶𝐷4+. Exp. SVC RF MLP LR File No. № BA F1 BA F1 BA F1 BA F1 1 File-1 .945 .952 .985 .998 .964 .975 .984 .975 2 File-2 .959 .962 .998 .998 .981 .998 1. 1. 3 File-3 .987 .998 .998 .998 .997 .996 .998 .998 It can be observed from Table 3 that RF outperformed other classifiers for both 𝐵𝐴 = .985 and 𝐹 1 = .998 metrics for data file-1. LR performed better for data file-2, and all the classifiers achieved similar results on data file-3 for BA and F1 (Balanced Accuracy and F1-Score). In general, all the four ML classifiers performed very well, with a balanced accuracy score of > 0.945. Exp. 2 We examined the same four classifiers in the second experiment. The training data from 10 data files are added to a new file, making a more extensive training set. The obtained training set contains the same features and data distribution with corresponding labels as the source data and is split into 10-folds. The ML classifiers are trained and validated with stratified 10-fold cross-validation. Then, the resulting trained classifiers are subjected to evaluation on the testing sets of other 5 files. The performance of ML classifiers on each file for 𝐶𝐷4+ cell classification is presented in Table 4, in which average scores for both metrics are presented. It Table 4 Results of machine learning classifiers for 𝐶𝐷4+ event on accumulated data. Exp. SVC RF MLP LR File No. № BA F1 BA F1 BA F1 BA F1 1 File-11 .975 .985 .985 .998 .964 .975 .984 .985 2 File-12 .945 .945 .998 .998 .925 .918 .998 .998 3 File-13 .967 .988 .988 .998 .997 .998 .998 .998 4 File-14 .985 .998 .978 .988 .965 .968 .952 .965 5 File-15 .997 .998 .988 .998 .897 .896 .968 .958 can be observed from Table 4 that SVC and RF maintained their performance the same as for the first experiment. The MLP and LR have shown a slight downfall in results. The least score of BA for MLP on file-12 and file-15 was recorded at .925 and .897, respectively. The least score of BA for LR on file-14 and file-15 was recorded at .952 and .968, respectively. Generally, the performance for all classifiers is promising, with a score of BA > .897. 5. Conclusion and Discussion The field of flow cytometry witnesses a significative progress in blood samples analysis that brought the acquisition of a huge amount of data. Given these premises, automatic data analysis, carried out using supervised learning techniques that automatically categorize samples according to clinical protocols, can provide enormous benefits. Such analysis is possible through automated methods without human subjectivity and gating bias. We conducted two experiments to automate the manual gating procedure for classifying CD4+ T cells from flow cytometry data. Four ML supervised algorithms have been trained with samples manually gated by experts in the pre-processing phase. The current study demonstrates our method’s capacity to distinguish the T Helper Lymphocytes CD4+ among all the types of cells present in the dataset, with high precision in terms of balanced accuracy and f1-score. This result suggests that, with training data available as gated examples, supervised classification offers an effective technique for automatic analysis of flow cytometry data, enabling to extract and compute the size of different cellular populations. Despite its apparent simplicity, this approach is of particular importance, as it constitutes a replicable mechanism in a series of increasingly complex contexts, which can be exploited for the realization of algorithms aimed at the automatic diagnosis of haemato-oncological pathologies with characteristic phenotypes. As the accuracy of all the models is very high, the future work is to explore the importance of the features and evaluate if some straightforward relationship between features and output is present. 6. Acknowledgment This work is a part of a collaboration project to automate the gating process of clinical flow cytometry at University of Urbino. References [1] C. P. Verschoor, A. Lelic, J. L. Bramson, D. M. Bowdish, An introduction to automated flow cytometry gating tools and their implementation, Frontiers in immunology 6 (2015) 380. [2] S. Montante, R. R. Brinkman, Flow cytometry data analysis: Recent tools and algorithms, International Journal of Laboratory Hematology 41 (2019) 56–62. [3] M. Cheung, J. J. Campbell, L. Whitby, R. J. Thomas, J. Braybrook, J. Petzing, Current trends in flow cytometry automated data analysis software, Cytometry Part A 99 (2021) 1007–1021. [4] M. Malek, M. J. Taghiyar, L. Chong, G. Finak, R. Gottardo, R. R. Brinkman, flowden- sity: reproducing manual gating of flow cytometry data by automated density-based cell population identification, Bioinformatics 31 (2015) 606–607. [5] G. Finak, J. Frelinger, W. Jiang, E. W. Newell, J. Ramey, M. M. Davis, S. A. Kalams, S. C. De Rosa, R. Gottardo, Opencyto: an open source infrastructure for scalable, robust, reproducible, and automated, end-to-end flow cytometry data analysis, PLoS computational biology 10 (2014) e1003806. [6] S. Van Gassen, B. Callebaut, M. J. Van Helden, B. N. Lambrecht, P. Demeester, T. Dhaene, Y. Saeys, Flowsom: Using self-organizing maps for visualization and interpretation of cytometry data, Cytometry Part A 87 (2015) 636–645. [7] J. A. DiGiuseppe, J. L. Cardinali, W. N. Rezuke, D. Pe’er, Phenograph and visne facilitate the identification of abnormal t-cell populations in routine clinical flow cytometric data, Cytometry Part B: Clinical Cytometry 94 (2018) 744–757. [8] P. Qiu, E. F. Simonds, S. C. Bendall, K. D. Gibbs, R. V. Bruggner, M. D. Linderman, K. Sachs, G. P. Nolan, S. K. Plevritis, Extracting a cellular hierarchy from high-dimensional cytometry data with spade, Nature biotechnology 29 (2011) 886–891. [9] M. Lux, R. R. Brinkman, C. Chauve, A. Laing, A. Lorenc, L. Abeler-Dörner, B. Hammer, flowlearn: fast and precise identification and quality checking of cell populations in flow cytometry, Bioinformatics 34 (2018) 2245–2253. [10] H.-C. Lee, R. Kosoy, C. E. Becker, J. T. Dudley, B. A. Kidd, Automated cell type discovery and classification through knowledge transfer, Bioinformatics 33 (2017) 1689–1695. [11] H. Li, U. Shaham, K. P. Stanton, Y. Yao, R. R. Montgomery, Y. Kluger, Gating mass cytometry data by deep learning, Bioinformatics 33 (2017) 3423–3430. doi:10.1093/ bioinformatics/btx448. [12] B. S. . B. A. J. Hu, Z., Application of machine learning for cytometry data, Frontiers in im- munology, 12, 787574. (2022). doi:https://doi.org/10.3389/fimmu.2021.787574. [13] J. H. Levine, E. F. Simonds, S. C. Bendall, K. L. Davis, D. A. El-ad, M. D. Tadmor, O. Litvin, H. G. Fienberg, A. Jager, E. R. Zunder, et al., Data-driven phenotypic dissection of aml reveals progenitor-like cells that correlate with prognosis, Cell 162 (2015) 184–197. [14] A. C. Belkina, C. O. Ciccolella, R. Anno, R. Halpert, J. Spidlen, J. E. Snyder-Cappione, Automated optimized parameters for t-distributed stochastic neighbor embedding improve visualization and analysis of large datasets, Nature communications 10 (2019) 1–12. [15] B. E. Boser, I. M. Guyon, V. N. Vapnik, A training algorithm for optimal margin classifiers, in: Proceedings of the fifth annual workshop on Computational learning theory, 1992, pp. 144–152.