Modeling of Small Data with Unsupervised Generative Ensemble Learning Serge Dolgikha a National Aviation University, 1 Lubomyra Huzara Ave, Kyiv, 03058, Ukraine Abstract Modeling and analysis of small data raises essential problems and challenges stemming from insufficient sampling of unknown observable distributions, complicating and exacerbating confident analysis and often reducing statistical confidence of the conclusions. In this work, an original approach to analysis of small data is proposed that is based on an ensemble of generative neural network models with the intent of identifying stable clusters of data in informative generative representations. We demonstrate how characteristic structure of stable clusters in generative representations of a dataset of images of basic geometric shapes can be determined from representations produced by a generative ensemble. The method can be used to identify characteristics structure, perform correlation analysis and augment data of different types and under some conditions that were discussed, improve the performance of supervised classification in cases with a deficit of training data. Keywords 1 Unsupervised learning, ensemble learning, clustering, statistical analysis, small data 1. Introduction Modeling and analysis of small data raises essential problems and challenges stemming from insufficient sampling of unknown observable distributions, complicating and exacerbating confident analysis and often producing lower statistical confidence of the conclusions. Nevertheless, early analysis of structure and trends in emerging data can be essential in situations and events of novel or rare nature / condition where large volumes of confident decisions may not be available for any reason [1]. Among the admitted challenges in practical applications of methods of machine intelligence in the analysis of small data are those of stability of learning and produced results. It can be observed for example, as a strong dependency of the learning success on the choice of training parameters, selection, temporal ordering of batches and other training factors. Issues that have been noted [2,3] include reproducibility of the results, overfitting, inability to generalize and others. Issuing from these challenges, results produced by models of similar architecture with the same datasets can be inconsistent and volatile, and the ability to generalize characteristic patterns, more limited than in conventional applications. Not in the least, reproducibility of the results that is essential in establishing confidence in the methods can be less certain, significantly complicating comparison of methods and models. Numerous efforts attempted to examine the problem of stability of small data learning and a number of promising approaches and directions described, including: cross-validation; ensemble methods [4]; Radial-Basis Function (RBF) networks [5,6] and other methods [7,8]. However, whereas some of the methods shown success in a number of specific applications, generality and applicability to different types of analyzed data and problems, to the best of our knowledge could not be established due to specialized structure, architecture and essential assumptions about the distributions. IDDM-2022: 5th International Conference on Informatics & Data-Driven Medicine, November 18–20, 2022, Lyon, France EMAIL: sdolgikh@nau.edu.ua (A. 1) ORCID: 0000-0001-5929-8954 ©️ 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings (CEUR-WS.org) In parallel to these developments, methods of unsupervised generative learning [9,10] demonstrated effective ability to achieve significant simplification of complex data in the process of unsupervised generative learning via reduction of redundancy in the observable parameters and identification (extraction) of informative features. In a growing number of instances, these methods were instrumental in the analysis of patterns in complex real-world data [11,12] including data strongly constrained by the size of the sample [13]. Critically for success in the identified problem area, application of such methods is not limited by availability of prior data including labeled datasets, and in many cases can be successful with smaller samples than conventional methods of supervised learning. These traits set methods of unsupervised generative learning as good candidates for analysis of data constrained by both size and availability of confident prior knowledge, without precluding aggregation of confidently known data for subsequent analysis with conventional methods. To address these challenges as outlined above, we propose the ensemble approach [14] based on a collective of unsupervised generative neural networks to identify stable patterns and structure in the observable (training) data, that simultaneously addresses both problems of the deficit of known labels, and stability of learning. Stable structures in the informative low-dimensional representations of observable data produced by generative models can be identified via a process that was developed and used for several purposes, including correlation analysis by factors of interest, augmentation of small data by producing newly generated data points from identified characteristic latent structure. In contrast to some of the methods mentioned earlier, this approach does not have strong dependencies on specific assumptions about observable distributions and can be used in a generic manner with data of different types and origin. 2. Methods An ensemble of generative neural network models with an architecture of a deep convolutional autoencoder with a strongly compressed representation layer was used to produce two or three- dimensional representations of small datasets of images of basic geometrical shapes. The advantage of the proposed architecture stems from previous applications in producing informative low-dimensional representations of complex data as well as universal approximation capacity of neural network models [15] making them a useful and versatile tool in modeling diverse distributions of complex data. Following successful generative training of the models in the ensemble, embedded distributions of the evaluated datasets were produced in the spaces of latent coordinates. We then attempted to identify stable populations of recognizable latent structures, such as visually identifiable clusters. The resulting structure of identified stable clusters had to be invariant with respect to individual training model and therefore represent innate characteristics of the input data that was possible to verify due to known composition of the dataset. 2.1. Generative Neural Network Architecture A deep convolutional autoencoder neural [16,17] had the input layer of dimension p with 2-3 convolutional layers common in the practice of learning visual data. Models had convolutional layers for acquisition of visual features, one deep layer and a central encoding layer of size d, creating two- or three-dimensional (i.e. d = 2, 3) latent representations of the input data defined by activation values of the latent neurons in the encoding layer. Overall, generative autoencoder models in this work had 48,000 – 96,000 trainable parameters depending on configuration of layers. The decoding or generating stage was fully symmetrical to the encoder. The models were implemented in Tensorflow / Keras [18] with data processing, plotting and visualization Python packages used in the analysis of the results. An architecture diagram of the model is shown in Figure 1. Figure 1: Deep convolutional autoencoder architecture with a strong dimensionality reduction. Training of the models proceeded in an unsupervised process with minimization of generative error (i.e., the distance of the generated output on the input training batches) with MSE (mean squared error) and CCE (categorical cross-entropy) cost. Unsupervised training in this process over 10 – 25 epochs produced a strong reduction in the value of the cost function with majority in the ensemble of learning models. A success of generative learning was measured by two criteria: 1) drop in the value of the cost function on the validation dataset; 2) the ability of trained models to generate a randomly selected subset of training data (Figure 2). Up to 80 – 90% of training models trained successfully based on these criteria. Figure 2: Generative training with geometrical shapes dataset: evaluation of training success. 2.2. Data For verification of the approach, we used a dataset of greyscale images of basic geometrical shapes including circles, triangles and empty backgrounds with variation in size and contrast. The composition of the dataset described in detail in [19]. Two small datasets of different sizes were used: G-150, 50 samples per shape, with overall size 150; and G-300, 100 samples per shape. Images of different geometrical shapes represented different characteristic patterns in small datasets of observable data. The dataset of geometrical shapes is described in Table 1. Table 1 Main characteristics, Geometric shapes dataset Dataset Size Input size Composition Variation parameters G-150 150 32 x 32 3 shapes: circle, triangle, size, contrast greyscale background G-300 300 32 x 32 3 shapes size, contrast 2.3. Unsupervised Ensemble Learning We decided to approach the question of stability of learning with small data with a set (an ensemble) of generative neural network models that do not require prior knowledge, including in the form of labeled data, for successful training. An ensemble of trained generative models of a size n thus produced an array of two-dimensional representations of the input data as shown in Figure 3. Figure 3: 2D latent distributions, G-150 dataset, three independently trained models. As a result of this phase of unsupervised generative ensemble learning was produced a set of pairs: R = { (trained model, map(input data point, 2D latent position) × n }. Association of a unique id to a latent position of an input data point (Figure 2, e1) relative to that of the other points in the set allowed to identify stable clusters K in the input data by entirely unsupervised process as follows (pseudocode, where D is input dataset, E: encoding phase of the generative model, where e(x) = E(x) encoded (latent) image of observable data point, m: the number of identified clusters): for x(k) in D: if ek = E(x(k)), ek-l in Kl : K(x) = Kl (known cluster) else if conf(ek, L) > γ : K(x) = L; m: = m+1 (a new cluster) else: K(x) = A (not in cluster) where A: an arbitrary id for elements with uncertain cluster; γ: confidence of identification of the latent position ek belonging to cluster L. The process can be described as follows: if the next element is in the same cluster as an earlier one (i.e., with a lower sequential id in the dataset), the cluster of that element is assigned. If cluster of the element is uncertain, an arbitrary constant number is assigned; finally, if neither of the conditions is satisfied, a new cluster is assigned and the process is repeated until the dataset is exhausted. This process is deterministic, and as can be seen does not depend on selection of ordering sequence, as long as the same sequence is maintained throughout the process. There result is a matrix K(D) of (data point, cluster id) pairs of a dimension (M × n) where M is the size of the dataset and n, the of the ensemble of generative models, where points in the same cluster (including uncertain cluster association) have the same cluster id. In the final step one can obtained the set of stable clusters Ks(D) identified by the ensemble as a subset of K(D) satisfying certain confidence criteria cs: (1) For example, if correlation of K(x) in the matrix K(D) was found to be 0.9 (i.e., 9 out of 10 models in the ensemble produced the same cluster id for a given element) and the size of the ensemble, 20 then the 95% confidence interval for the correlation coefficient of the element and the cluster would be [0.76, 0.96] indicating a strong and confident association of an element to a cluster. The resulting subset of stable clusters Ks(D) identified in the described process can be used in an analysis of composition of the input data and a number of other applications as discussed in the subsequent sections. It may be worth reiterating that at no point in the analysis any true known samples of classes in the input dataset were used. 3. Results 3.1. Cluster Structure The results of evaluation of the cluster structure with the datasets of images in a process described in the preceding section are presented in Table 2 and Table 3. Identification of clusters was performed by visual analysis (examples in Figure 3) that demonstrate both stability and accuracy of the identified cluster structure with the data in the study. In the future, unsupervised clustering methods such as DbScan [20], Meanshift and others can be applied to identify stable cluster structure in an automated unsupervised process. Table 2 Cluster composition, G-150 and G-300 datasets, n(1) = 20 Dataset Number, Clustered fraction(2) Clustered fraction Visible cluster clusters at conf = 0.8 at conf = 0.95 separation G-150 3 1.0 0.92 high G-300 3 1.0 0.96 high, very high (1) Ensemble size (2) Fraction of the dataset in identified stable clusters, at confidence level Table 3 Cluster confusion matrix, G-150 dataset Cluster, type Cluster 1 Cluster 2 Cluster 3 0 (circle) 1.0 0. 0. 1 (triangle) 0.25 0.75 0. 2 (background) 0. 0.15 0.85 With a large unsupervised dataset, from G-150 to G-300 significantly improved accuracy of the cluster to type association was observed, rising to the level of 95 – 100%. Stability of the latent structure of clusters is a key observation and a necessary requirement for successful generation of new data, confirming that clusters identified by the unsupervised generative ensemble method indeed described stable characteristic patterns in the observable data. 3.2. Generation and Prototypes The architecture of generative models provides a direct way to propagate positions in the latent space of generative models to the space of input (observable) parameters. The mapping can be obtained by taking a latent position with coordinates p = (l1, l2) as input to the generator component of the model (Figure 1) to produce an observable position Xobs: (2) where G: R  O, is the generator component of the model, operating from the latent space R into the space of inputs, O. Based on (2) generative ability of successfully trained generative models can be used to create new data points with “similarity” to identified characteristic patterns by selecting positions in the latent regions of identified stable clusters as illustrated in Figure 4. Figure 4: Cluster-based data generation. Green, red dots: stable latent clusters; cross: newly generated data points; blue: other data points. The effectiveness of the proposed method of ensemble cluster-based data augmentation can be supported by the arguments: Consider a small data set S of size N with p input parameters. With a conventional method of approximation, e.g. Gaussian, the error of the mean in each of the parameters of the data can be estimated as: pmean / √N [21]. Where N is small the dispersion of observable parameters can be sufficiently large. Next, if a successful generative representation of a lower dimensionality d with a good cluster structure existed, the data can be approximated with a good accuracy by a quasi-multi- modal distribution with the number of modes Nc, where Nc: the number of stable latent clusters, and the dispersion, dmean / √nclus (the size of a latent cluster). Where d is small: (i.e., a strong reduction of dimension of observable data) and the number of samples in the principal clusters, sufficiently large one obtains a statistical problem of significantly lower complexity and dispersion. As an example, in this study, the reduction of dimensionality from 4,096 input parameters (grayscale images with resolution 64 × 64) to 2, i.e. by a factor of ~2,000 was achieved. 4. Applications 4.1. Data and Factor Analysis Decomposition of unsupervised datasets into a structure of stable latent clusters, where successful, can be helpful in the analysis of the distribution of data and associated factors of interest. As discussed earlier, the proposed approach offers a general, independent of specific types of data capability to decompose observable data into a structure of more homogenous regions, or clusters. Even without known samples of classes of interest, distributions of clusters, both latent and observable can be analyzed in detail, including observation of characteristic representatives of clusters, prototypes. Generative ability of models in the study combined with cluster decomposition of informative representations allows non-trivial analysis of composition of input data by generating typical observable instances of stable clusters, or prototypes of characteristic natural classes of data. With stable clusters identified with proposed methods, one can generate specific latent positions associated with clusters for example, as a mean of cluster member positions and propagate them to the observable space with generative transformation (2). (3) where K: stable latent cluster, P(K): observable prototype. Examples of observable prototypes of clusters in the dataset G-150 are shown in Figure 5. Figure 5: Cluster prototypes, G-150 dataset. Importantly, cluster decomposition can provide insights into essential associations of the data to the factors of interest, again without any previously known context or associations. For an example consider a hypothetical dataset where input datapoints { x } in the dataset G-150 represent patient data and are associated with certain factor of interest f(x), such as a reaction to an infection. A direct correlation analysis in the observable space can be challenging due to large number of observable parameters (close to 5,000). On the other hand, cluster decomposition as discussed earlier can provide a mapping of input data to its stable cluster: x  K(x) resulting in a single dimension correlation problem (f(x), K(x)), a massive reduction of complexity. 4.2. Data Generation and Augmentation Based on the discussion in Section 3.2 cluster decomposition can be used to generate new data points and augment small datasets, again without any limitations on the prior knowledge of the distribution of the input data. Once a structure of stable latent clusters has been identified, it is straightforward to determine their distribution regions and produce latent candidate positions for augmentation of the original data. Generative transformation (2) can then be applied to obtain the related data points in the space of observable parameters. To summarize the results and discussion on data generation, ensemble-based generative augmentation of small data can be successful under these conditions:  The latent dimensionality is sufficiently small: .  The models demonstrate good learning success and consistent, stable, cluster structure.  The number of stable clusters is small compared to the size the dataset (the number of samples), and the population of main clusters is sufficiently large: N / nclus 1. If the conditions are satisfied, the original data can be described by a multi-modal distribution of stable latent clusters that can be identified with density clustering or another method as demonstrated earlier; further, augmentation of data can be performed in an unsupervised process based on the identified distribution in the latent space and will provide stable results invariant with respect to the selection of a specific instance of generative model. 4.3. Classification The method of augmentation based on unsupervised cluster structure produced in generative learning can be employed to improve the success of classification with models of supervised learning trained with small datasets, as in the conventional practice of supervised learning, the size and representative quality of training data can have strong influence on the accuracy of classification [22]. The success of the method essentially depends on the presence of a correlation between stable latent structure of clusters and the factor of interest for classification that can be used as a label in supervised learning. If such a correlation can be established between the data points in identified stable clusters and the factor of interest as discussed in Section 4.1, an augmentation process outlined in Sections 3.2, 4.2 can be applied, with class labels assigned to generated data points based on the established association between latent clusters and known classes. Such a process of augmentation can produce an improvement in classification accuracy due to larger and more representative dataset in supervised learning. For example, a clustering analysis performed with the ECE dataset [13] demonstrated a strong correlation of the identified cluster structure with the classification factor of interest, an epidemiological outcome. In that and similar cases where correlation of identified cluster structure with an observable factor of interest can be established, augmentation of data with the method described in this work can produce substantial improvements in classification. 5. Conclusions The method of identification of unsupervised cluster structure in generative representations of small datasets with an ensemble of unsupervised generative models has been described and verified with small datasets of visual data of basic geometrical shapes. It was shown that a stable structure of clusters can be identified in the latent representations of successful generative models with high confidence in the interval 90–100% with the dataset used in the study. The structure of stable clusters representing characteristic types in the input data can be used to augment small datasets by generating new data points with several potential applications, including enhancement of labeled datasets with an objective to improve the success of classification in supervised learning. It needs to be noted that the method described in this work may not have universal applicability to all datasets and its effectiveness is defined not only by the observable parameters, but also by the composition and characteristics of the dataset such as: the size, the number and population of principal clusters in the latent representation of the data and their correlation with the factors of interest. Where the conditions described in Section 3.2 are met, augmentation can produce additional data points associated with principal clusters and improve the performance of conventional classification methods trained with augmented datasets. In novel, rare, non-standard cases, events and environments, large volumes of known data may be needed for confident analysis with conventional methods of machine learning. Availability of training data may present strong challenges in such cases. Methods of unsupervised generative learning, including the ensemble approach presented in this work, can be successful in identification of characteristic structure and patterns even with smaller sets of observable data without requirements of massive prior knowledge offering a practical direction toward improving the confidence of the analysis. 6. References [1] Hekler E.B., Klasnja, P., Chevance G. et al.: Why we need a small data paradigm, BMC Medicine, 17 (1) 133 (2019). [2] Wasserman P.D.: Neural computing: theory and practice. Van Nostrand-Reinhold, New York (1989). [3] LeBaron B., Weigend A.S.: A bootstrap evaluation of the effect of data splitting on financial time series. IEEE Trans. Neural Networks 9 213–220 (1998). [4] Cunningham P., J. Carney, S. Jacob S.: Stability problems with artificial neural networks and the ensemble solution. Artificial Intelligence in Medicine, 20 (3) 217–255 (2000). [5] Karar M.E., Robust RBF neural network-based backstepping controller for implantable cardiac pacemakers, Int. J. Adap. Cont. Sign. Proc 32 1040–1051 (2018). [6] Izonin, I., Tkachenko R., Dronuyk I. et al.: Predictive modeling based on small data in clinical medicine: RBF-based additive input-doubling method. Math Biosc. Eng, 18 (3) 2599–2613 (2021). [7] Forman G., Cohen I.: Learning from little: comparison of classifiers given little training. In: Proceedings of PKDD, 19 161–172 (2004). [8] Geris L.: Computational modeling in tissue engineering. Springer-Verlag, Berlin (2013). [9] Bengio, Y.: Learning deep architectures for AI. Foundations and Trends in Machine Learning 2(1), 1–127 (2009). [10] Coates, A., Lee, H., Ng, A.Y.: An analysis of single-layer networks in unsupervised feature learning. In: Proceedings of 14th International Conference on Artificial Intelligence and Statistics 15, 215–223 (2011). [11] Gondara, L.: Medical image denoising using convolutional denoising autoencoders. In: 16th IEEE International Conference on Data Mining Workshops (ICDMW), Barcelona, Spain, 241– 246 (2016). [12] Shi, J., Xu, J., Yao, Y., and Xu, B.: Concept learning through deep reinforcement learning with memory augmented neural networks. Neural Networks 110, 47–54 (2019). [13] Dolgikh, S.: Unsupervised clustering in epidemiological factor analysis. The Open Bioinformatics Journal 14(1), 63–72, 2021. [14] Opitz, D., Maclin, R.: Popular ensemble methods: An empirical study. Journal of Artificial Intelligence Research, 11 169–198, 1999. [15] Hornik K., Stinchcombe M., White H.: Multilayer feedforward neural networks are universal approximators. Neural Networks 2(5), 359–366 (1989). [16] Le, Q.V.: A tutorial on deep learning: autoencoders, convolutional neural networks and recurrent neural networks. Stanford University, 2015. [17] Dolgikh, S.: Low-dimensional representations in generative self-learning models. In: Proc. 20th International Conference Information Technologies – Applications and Theory (ITAT-2020), Slovakia, CEUR-WS.org 2718, 239–245 (2020). [18] Keras: Python deep learning library. https://keras.io/, last accessed: 2021/08/21. [19] Dolgikh, S.: Topology of conceptual representations in unsupervised generative models. In: Proc. 26th International Conference on Information Society and University Studies (IVUS 2021) Kaunas, Lithuania, CEUR-WS.org 2915, 150–157 (2021). [20] Ester, M., Kriegel, H-P., Sander, J., et al.: A density-based algorithm for discovering clusters in large spatial databases with noise. Proc. Second International Conference on Knowledge Discovery and Data Mining (KDD-96) 226–231 (1996). [21] Wendland H.: Scattered data approximation. Cambridge University Press 9 (2005). [22] Richards J.A.: Supervised classification techniques. In: Remote Sensing Digital Image Analysis. Springer, Berlin, Heidelberg 247–318 (2013).