An Inductive Framework for Semi-supervised
Learning
(Discussion Paper)

Shuyi Yang1,2 , Dino Ienco3 , Roberto Esposito1 and Ruggero G. Pensa1
1
  University of Turin, Italy
2
  Intesa Sanpaolo, Turin, Italy
3
  INRAE, UMR TETIS, Montpellier, France


                                         Abstract
                                         Semi-supervised learning is crucial in many applications where accessing class labels is unaffordable
                                         or costly. The most promising approaches are graph-based but they are transductive and they do not
                                         provide a generalized model working on inductive scenarios. To address this problem, we propose a
                                         generic framework for inductive semi-supervised learning based on three components: an ensemble
                                         of semi-supervised autoencoders providing a new data representation that leverages the knowledge
                                         supplied by the reduced amount of available labels; a graph-based step that helps augmenting the train-
                                         ing set with pseudo-labeled instances and, finally, a classifier trained with labeled and pseudo-labeled
                                         instances. The experimental results show that our framework outperforms state-of-the-art inductive
                                         semi-supervised methods.

                                         Keywords
                                         semi-supervised learning, graph-based algorithms sep inductive methods


1. Introduction
Prediction is one of the most important outcomes of any machine learning algorithm, but its
accuracy strongly depends on the amounts and quality of labeled instances and, unfortunately,
labeling is a cost-intensive manual activity requiring time, money, and expertise. Hence,
labeling often turns out to be unaffordable for many organizations and, consequently, only
small amounts of labeled instances are available for training. Semi-supervised learning aims
at mitigating the above-mentioned problem by leveraging the so-called smoothness and cluster
assumptions: if two data instances are close to each other or belong to the same cluster in the
input distribution, then they are likely to belong to the same class [1]. Graph-based models
constitute one of the main families of semi-supervised techniques [2]. They leverage the manifold
assumption: the graphs, typically nearest neighbor graphs built upon the local similarity between
data points, provide a lower-dimensional representation of the high-dimensional input data.

SEBD 2021: The 29th Italian Symposium on Advanced Database Systems, September 5-9, 2021, Pizzo Calabro (VV),
Italy
" shuyi.yang@unito.it (S. Yang); dino.ienco@inrae.fr (D. Ienco); roberto.esposito@unito.it (R. Esposito);
ruggero.pensa@unito.it (R. G. Pensa)
 0000-0002-8736-3132 (D. Ienco); 0000-0001-5366-292X (R. Esposito); 0000-0001-5145-3438 (R. G. Pensa)
                                       © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073       CEUR Workshop Proceedings (CEUR-WS.org)
Xl                       Xu           Xt           Unfortunately, graph-based methods are
            {xl , yl }        {xu }        {xt , yt }
                                                   transductive [2], i.e., they do not construct
                                                   any classification model and the prediction is
                                                   limited to exactly those data instances that are
 embedding
 Builds the

   model


                     ESA
                                                   already available during the training phase.
                                                   Therefore, graph-based methods are unable
                      Φ                            to classify new data examples, unless they are


                                                        embeddings
                                                   trained again on the augmented dataset. A


                                                         Computes
   {Φ(xl ), yl }   {Φ(xu )}     {Φ(xt ), yt }      second limitation concerns the construction
                                                   of the graph: in general, this phase is com-
                                                   pletely unsupervised even though, for some
                                                   instances, labels are available. When the clus-
 Finds labels


                    GBPL
   for Xu


                                                   ter assumption is not completely satisfied, this
                                                   could lead to poor prediction results. In this
                 {Φ(xu ), ỹu }
                                                   paper, we present a novel graph-based semi-
                                                   supervised framework, ESA⋆ , that improves
 classification


                  Learning                         in the areas mentioned above: it takes into ac-
    Builds

     model


                  Algorithm
                                                   count the information carried out by labeled
                                                   instances during the graph construction and
                  Classifier                       is designed to work properly in inductive set-
                                                   tings.
                                                      Our approach (sketched graphically in
                                {Φ(xt ), ỹt }     Fig. 1) first constructs a new representation us-
                                                   ing a semi-supervised autoencoder that takes
      Figure 1: The proposed framework.            all labeled and unlabeled training data as in-
                                                   put. The representation learnt by the semi-
supervised autoencoder, for both labeled and unlabeled training data, are then processed by
a graph-based semi-supervised algorithm that propagates the label information from labeled
to unlabeled data instances. This procedure provides pseudo-labels for the set of unlabeled
instances. We propose two variants of our framework: the first, ESALP , is based on a graph-based
label propagation algorithm [3]; the second one, ESAGAT , exploits a graph convolutional neural
network with masked self-attention layers [4]. All training instances (labeled and unlabeled
with pseudo-labels) are then used to train a classification model, which can perform prediction
for new unseen examples as well. We show that our approach outperforms state-of-the-art
approaches and is competitive towards state-of-the-art transductive methods, even with ex-
tremely small amounts of labeled instances. A more in-depth presentation of our proposal is
reported in the full version of this paper [5].


2. Inductive graph-based semi-supervised learning
In a semi-supervised learning setting, in addition to labeled instances, unlabeled ones are in-
troduced as part of available data during the training phase: let 𝑋𝑙 ∈ R𝑛𝑙 ×𝑓 be the matrix of
𝑛𝑙 labeled samples each with 𝑓 predictors and 𝑦𝑙 be the corresponding labels, then a supple-
mentary matrix 𝑋𝑢 ∈ R𝑛𝑢 ×𝑓 representing 𝑛𝑢 unlabeled instances is also provided without the
corresponding 𝑦𝑢 labels. Generally, the number 𝑛𝑙 of labeled instances is limited and much
smaller than the number 𝑛𝑢 of unlabeled instances. Our framework aims to provide an inductive
semi-supervised learning algorithm by leveraging graph-based semi-supervised learning in
order to augment the amount of labeled instances to train a supervised classifier.
   As shown in Figure 1, our framework consists of different parts: embedding computation,
pseudo-labeling of unlabeled instances, and classification. In the embedding computation
part, we train an ensemble of neural networks to extract a latent representation for each
instance. These representations are used to build a graph over labeled and unlabeled instances
so that a graph-based model can be employed to provide a pseudo-label for each unlabeled
instance. Finally, labeled instances and pseudo-labeled ones are both used to train a supervised
classification model.
   In order to extract the data embeddings, an Ensemble of Semi-supervised Autoencoders (ESA)
[6] is trained on both labeled and unlabeled data. The loss function we use to learn the internal
parameters of the SSAE is a combination of reconstruction and classification loss. More formally

                                        𝐿SSAE = 𝐿AE + 𝜆𝐿CL                                       (1)

where                                      ∑︁
                                 1
                      𝐿AE =                          ||𝑥𝑖 − 𝐷(𝐸(𝑥𝑖 |𝜃𝐸 )|𝜃𝐷 )||2 ,               (2)
                              𝑛𝑙 + 𝑛𝑢
                                        𝑥𝑖 ∈𝑋𝑙 ∪𝑋𝑢

                                          |𝐶|
                              1 ∑︁ ∑︁
                      𝐿CL = −         𝑦𝑙𝑖𝑐 · log(CL(𝐸(𝑥𝑖 |𝜃𝐸 )|𝜃CL )𝑐 ),                         (3)
                              𝑛𝑙
                                   𝑥𝑖 ∈𝑋𝑙 𝑐=1

and 𝜃𝐸 , 𝜃𝐷 and 𝜃CL are respectively the set of parameters of the encoder, decoder and classifica-
tion layer, 𝑦𝑙𝑖𝑐 is the 𝑐-th element of the 𝑖-th row of 𝑦𝑙 , 𝐶𝐿(·)𝑐 is the 𝑐-th element of the output
vector of CL and 𝜆 is a parameter that controls the importance of the classification loss. In our
architecture the encoder has an input layer followed by other two hidden layers; the decoder
has one hidden layer of the same size of the first hidden layer of the encoder and an output
layer. The size of the input layer, the output layer and the classification layer are respectively
fixed to 𝑓 , 𝑓 and |𝐶|, while 𝑠𝑖𝑧𝑒ℎ𝑖𝑑𝑑𝑒𝑛 and 𝑠𝑖𝑧𝑒𝑏𝑜𝑡𝑡𝑙𝑒𝑛𝑒𝑐𝑘 (respectively, the size of the hidden
layer and that of the bottleneck one) can be varied. In order to get diverse and multi-resolution
representations, similarly as in [6], we train 𝐾 independent SSAEs, each with the sizes of the
layers extracted randomly from the intervals 𝑓2 ≤ 𝑠𝑖𝑧𝑒ℎ𝑖𝑑𝑑𝑒𝑛 < 𝑓 and 𝑓4 ≤ 𝑠𝑖𝑧𝑒ℎ𝑖𝑑𝑑𝑒𝑛 < 𝑓2 .
Once the ensemble is trained we obtain the new representations Φ(𝑋𝑙 ), Φ(𝑋𝑢 ) of 𝑋𝑙 and 𝑋𝑢
by concatenating the embeddings of these 𝐾 SSAEs:

                                     Φ(·) = ||𝑘=𝐾
                                              𝑘=1 𝐸𝑘 (·|𝜃𝐸𝑘 )                                    (4)

where || is the stack (concatenation) operator and 𝐸𝑘 and 𝜃𝐸𝑘 are respectively the encoder of
the 𝑘-th SSAE and its weights.
   Given the latent representations, a kNN graph structure can be derived from the data points
of 𝑋𝑙 ∪ 𝑋𝑢 : embedding representations are nodes and two of them can be considered connected
if both of them belong to the top 𝑘 nearest neighbors of each other, respectively.
                                 {Φ(xl ), yl }                                                                                   {Φ(xl ), yl }


                                                                                      
                                                             1           #
                                                 Fiϕ =            yiϕ + β   Aij sjϕ (ϕ)
                                                             Zi           j
                                                                                                                         


                                                                                            Adj matrix
                                                                                                         A11   ...    A1n
                                       r                              r
                                                                                                        ..   ..      .. 
                                     F11         ...        ...      F1ϕ                                  .      .     . 
                                   ..                              ..  = F r
               A11 . . . A1n         .                     ..                                           An1   ...    Ann
           
                                                   r
                                                 Fiϕ           .       . 
                ..  ..    .. 
                .     .   .            r
                                        Fn1      ...        ...       r
                                                                     Fnϕ
               An1 . . . Ann
                Adj matrix                                                   numiter

                                                                   numiter
                                                              F
           GBPL-LP                                                                         GBPL-GAT


                                        {Φ(xu ), ỹu }                                                                {Φ(xu ), ỹu }

                                                       LP
                                     (a) ESA                                                                          (b) ESAGAT

Figure 2: A graphical representation of the label propagation strategy of ESALP (a) and ESAGAT (b)


   At this point we perform graph-based pseudo-labeling (GBPL) to assign pseudo class labels
to unlabeled instances. To this purpose, any graph-based semi-supervised learning algorithm
(GBSSL) can be applied to infer the labels of the unlabeled portion of data 𝑦˜𝑢 by propagating
the class information from the labeled data 𝑦𝑙 over the graph constructed on the embeddings.
Successively, a supervised classifier (SC) can be trained leveraging the union of the labeled
data (Φ(𝑋𝑙 ), 𝑦𝑙 ) with the pseudo-labeled one (Φ(𝑋𝑢 ), 𝑦˜𝑢 ) as training set. In prediction, we
first compute the latent representation of unseen data Φ(𝑋𝑡 ) with the trained ESA, then we
make predictions with the supervised classifier SC. It is worth pointing out that during the
entire process, the transductive GBSSL process is used only during the training phase to provide
pseudo-labels of the unlabeled data (as in wrapper methods) in order to help the supervised
classifier to generalize better. Therefore, our approach, hereinafter referred as ESA⋆ , is inductive.
In the next two sections, we present two variants adopting different strategies to perform
pseudo-labeling based on different graph-based semi-supervised learning approaches.

2.1. Pseudo-labeling based on confidence-aware label propagation
In this section we introduce the first variant of our framework for semi-supervised learning (see
Figure 2(a)). The adopted strategy consists in instantiating the graph-based pseudo-labeling
(GBPL) part with a confidence-aware label propagation algorithm working on both homophily
and heterophily networks [3]. In the following we provide the details of this strategy, which we
name ESALP . Given the adjacency matrix   (︁ 𝐴, ESA ∑︀computes the
                                                    LP
                                                                  )︁ probability distribution over
the classes as the solution of 𝐹𝑖𝜙 = 𝑍𝑖 𝑦𝑖𝜙 + 𝛽 𝑗 𝐴𝑖𝑗 𝑠𝑗𝑖 (𝜙) , where 𝐹𝑖𝜙 is the probability
                                        1

that 𝑖-th instance has label 𝜙, 𝑍𝑖 is a normalization term, 𝑦𝑖𝜙 is the prior belief of 𝑖-th instance
having label 𝜙, 0 < 𝛽 represents the importance of the neighborhood’s influence, 𝐴𝑖𝑗 is the 𝑖, 𝑗
entry of the adjacency matrix and 𝑠𝑗𝑖 (𝜙) represents
                                                ∑︀       how intense the node 𝑗 believes that the
node 𝑖 has class 𝜙. More formally, 𝑠𝑗𝑖 (𝜙) = 𝑙 𝐹𝑗𝑙 𝐻𝑙𝜙 , where 𝐻 is the modulation matrix. If
𝐻𝑙𝜙 is low then class 𝑙 has a low correlation with the class 𝜙, on the contrary, if it is high these
two classes have a strong correlation. On homophily networks, 𝐻 is the identity matrix, while
on heterophily networks it can be designed empirically. In our experiments we assume that
the graph obtained by the embeddings of ESA is a homophily network. We can rewrite the the
last equation in matrix form and in an iterative way, i.e., 𝐹 𝑟+1 = 𝑍 −1 (𝑌 + 𝛽𝐴𝐹 𝑟 𝐻), where
𝑍 = 𝐼 + 𝛽𝐷 and 𝐷 is the node degrees diagonal matrix. Once obtained the adjacency matrix
𝐴 of labeled and unlabeled instances, as described in the previous section, we initialize 𝐹 0 as a
(𝑛𝑙 + 𝑛𝑢 ) × |𝐶| matrix of zeros. Then we apply the iterative formula 𝑛𝑢𝑚𝑖𝑡𝑒𝑟 times to obtain
𝐹 𝑛𝑢𝑚𝑖𝑡𝑒𝑟 , which represents the probability distributions of the instances over the classes. From
𝐹 𝑛𝑢𝑚𝑖𝑡𝑒𝑟 we extract only the predictions of 𝑋𝑢 and keep the original labels for 𝑋𝑙 . They are
then used to feed a classifier 𝑆𝐶.

2.2. Pseudo-labeling based on graph attention networks
For the second variant, we consider a completely different approach leveraging the convolution
operation with graph attention networks (GAT) [4], which are able to capture different levels of
importance of features of neighborhood nodes in the kNN graph built upon the embeddings
computed by ESA. We call this strategy ESAGAT and provide the details below (a graphical
representation is given in Figure 2(b)).
  Given the set of nodes, each represented by a 𝑏-dimensional real numbers array obtained from
the ESA embedding process or as a result of a previous convolutional layer, we can compute
the self-attention on nodes as 𝑒𝑖𝑗 = 𝑎(𝑊 ℎ𝑖 , 𝑊 ℎ𝑗 |ℵ), where ℎ𝑖 , ℎ𝑗 ∈ R𝑏 are the embeddings of
                                                                                     ′     ′
instance 𝑖 and 𝑗, 𝑊 is a 𝑏′ × 𝑏 shared linear transformation matrix, and 𝑎 : R𝑏 × R𝑏 → R is
the attentional mechanism consisting in a feedforward layer with weights ℵ and LeakyReLU
activation [7]. Each 𝑒𝑖𝑗 is computed only for connected nodes (masked attention) so that
the graph structure is embedded into the coefficients. The attention coefficients are then
                                                         exp 𝑒𝑖𝑗
normalized using the softmax function, i.e., 𝛼𝑖𝑗 = ∑︀         exp 𝑒𝑖𝜄 , with N (𝑖) representing the
                                                        𝜄∈N (𝑖)
nodes connected to the node 𝑖. The
                                 (︁∑︀new representation
                                                     )︁   of the node 𝑖 through the attention
                           ′
layer is then computed as ℎ𝑖 = 𝜎     𝑗∈N (𝑖) 𝛼𝑖𝑗 𝑊 ℎ𝑗 , with 𝜎 a non linear transformation. As
in ESA, we can concatenate the outputs of different independent attention layers in order to
employ a multi-head attention mechanism. To do that, once the ESA embeddings are obtained,
we can apply the last formula multiple times: from the point of view of the neural network
structure, the mechanism is realized by adding additional convolutional layers, each with its
own weights to be trained and the number of nodes of the last layer should be equal to the
number of classes.


3. Experiments
We assess the behavior of our framework under different settings, using Random Forests (RF)
as final classifier. Thirteen publicly available real-world classification datasets, encompassing a
wide variety of application scenarios, have been considered in our experiments. They exhibits
different sizes (from 178 to 70 000 instances) and dimensionality (from 12 to 1087 features). Nine
datasets (ANTIVIRUS, LANDSAT, MADELON, MALWARE, PARKINSON, SONAR, SPAMBASE,
WAVEFORM, WINE) are from the UCI Machine Learning Repository1 , four are well-known
    1
        http://archive.ics.uci.edu/ml
image datasets (USPS [8], MNIST2 , FMNIST3 , COIL204 ). Each dataset is randomly split into three
parts: labeled instances (𝑝% of the dataset), unlabeled instances ((70 − 𝑝)% of the dataset), and
test instances (30% of the dataset). The random split is stratified so that each dataset maintains
the same proportion of labels as in the original datasets. Every supervised model is trained on
labeled instances only and evaluated on test instances, while all semi-supervised models are
trained both on labeled instances and unlabeled ones and evaluated on the test instances. During
the experiments we vary the percentage 𝑝 of labeled instances to study how the performances
change when the portion of labeled instances increases in both supervised and semi-supervised
models.
   To obtain more robust performance indicators, for each combination of dataset, percentage 𝑝
and model, we evaluate 25 different random splits as described above and then take the average
performances. The model is re-trained for each of the 25 different splits and new predictions
are made on every different test set. As performance index, we consider the micro-averaged
F1-score computed on the test set. In all evaluations of the experiments, we compute detailed
performance results for each dataset, but here we report a summary of the results (Figure 3).
The latter is obtained as follows: for a given percentage 𝑝 of labeled instances, we compute the
average ranks across all the dataset for each algorithm, according to the micro-averaged F1
score, and then we plot them for increasing values of 𝑝%. This allows us to obtain an overall
picture of the relative performances of all competitors considered in our study. In the following,
we present and discuss the results for each experiment.

3.1. Comparative analysis w.r.t. inductive methods
In the first experiment, we compare two configurations of our framework to four well-known
supervised methods (Random Forests, Multilayer Perceptron, SVM and DRM [9]) and two recent
state-of-the-art semi-supervised approaches: interpolation consistency training (ICT) [10] and
ladder networks (LN) [11]. From the obtained results (Figure 3(a)), it emerges that, for every
percentage of labeled examples, on average, the two variants of our framework outperform
all other methods, including the four fully supervised classifiers considered in this study. It is
worth noting that, in contrast, the two competing semi-supervised methods (ICT and LN) are not
able to outperform the supervised competitors with the same consistency. The micro-averaged
F1 score of ICT is below the one of RF, for any given value of 𝑝. Ladder networks (LN) are
ranked third with less than 2% of labeled samples, but RF is still competitive w.r.t. LN despite
the fact that it does not take advantage of unlabeled instances. Finally, it is worth pointing out
that, not surprisingly, when the number of labeled instances increases, the differences between
semi-supervised methods and fully supervised ones decrease.

3.2. Comparative analysis w.r.t. transductive methods
We compare two configurations of our framework to two state-of-the-art transductive semi-
supervised approaches: structured graph learning with multiple kernel (SGMK) [12] and robust

   2
     http://yann.lecun.com/exdb/mnist/
   3
     https://github.com/zalandoresearch/fashion-mnist
   4
     https://www.cs.columbia.edu/CAVE/software/softlib/coil-20.php
                                                ESALP+RF                     RF                                           SVM               ICT             SGMK
                                                ESAGAT+RF                    MLP                                          DRM               LN              RGC

                                                                                                                      1.5
    Average rank (lower is better)


                                                                                         Average rank (lower is better)
                                 3
                                                                                                                      2.0


                                 4                                                                                    2.5


                                 5                                                                                    3.0


                                                                                                                      3.5
                                 6

                                                                                                                      4.0

                                     0.1 0.5 1.0       2.0       3.0       4.0     5.0                                    0.1 0.5 1.0      2.0       3.0       4.0     5.0
                                             Percentage of labeled instances (%)                                                 Percentage of labeled instances (%)
                                            (a) ESA⋆ vs inductive methods                                                     (b) ESA⋆ vs transductive methods

Figure 3: Average rank according to micro-averaged F1-measure in the experiments.


graph construction (RGC) [13]. It is worth pointing out that, while the transductive setting has
potential liabilities in terms of applicability, it has advantages in the possibility of leveraging
more information than its inductive counterparts (since it can leverage the test set distribution
when propagating the labels). As we shall see, while our method is at a disadvantage here, it
works quite well nonetheless.
   We compare the variants ESALP +RF and ESAGAT +RF to the structured graph learning with
multiple kernel (SGMK) [12] and to the robust graph construction (RGC) [13]. Due to compu-
tational limitations, the experiments are performed only on a subset of the available datasets,
namely: ANTIVIRUS, SONAR, PARKINSON and WINE. From the results shown in Figure 3(b),
we can conclude that, within our experimental settings, ESALP +RF and ESAGAT +RF are able to
reach competitive performances compared to SOTA transductive methods (i.e. RGC), and, in
some cases, even outperform them (i.e. SGMK).


4. Conclusion
In this paper, we have presented a new inductive semi-supervised learning framework that
takes the most of two successful approaches: semi-supervised autoencoders and graph-based
semi-supervised learning. While the former supports the generation of new data representations
improved by labeled instances, the latter spread the label information to unlabeled instances in
the new representation space. Thanks to an extensive experimental study, we have shown that
a classifier trained with both labeled instances and pseudo-labeled instances achieves better
prediction accuracy than its supervised counterpart trained only on labeled ones, and also
outperforms state-of-the-art semi-supervised competitors.
References
 [1] O. Chapelle, B. Schölkopf, A. Zien (Eds.), Semi-Supervised Learning, The MIT Press, 2006.
 [2] J. E. van Engelen, H. H. Hoos, A survey on semi-supervised learning, Mach. Learn. 109
     (2020) 373–440.
 [3] Y. Yamaguchi, C. Faloutsos, H. Kitagawa, CAMLP: confidence-aware modulated label
     propagation, in: S. C. Venkatasubramanian, W. M. Jr. (Eds.), Proceedings of the International
     Conference on Data Mining, SIAM 2016, Miami, Florida, USA, May 5-7, 2016, SIAM, 2016,
     pp. 513–521.
 [4] P. Velickovic, G. Cucurull, A. Casanova, A. Romero, P. Liò, Y. Bengio, Graph attention
     networks, in: Proceedings of the 6th International Conference on Learning Representations,
     ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, OpenReview.net, 2018.
 [5] S. Yang, D. Ienco, R. Esposito, R. G. Pensa, ESA⋆ : A generic framework for semi-supervised
     inductive learning, Neurocomputing 447 (2021) 102–117.
 [6] D. Ienco, R. G. Pensa, Enhancing graph-based semisupervised learning via knowledge-
     aware data embedding, IEEE Trans. Neural Networks Learn. Syst. 31 (2020) 5014–5020.
 [7] A. L. Maas, A. Y. Hannun, A. Y. Ng, Rectifier nonlinearities improve neural network
     acoustic models, in: Proceedings of the International Conference on Machine Learning,
     ICML 2013, Atlanta, USA, June 16-21, 2013, 2013.
 [8] J. J. Hull, A database for handwritten text recognition research, IEEE Transactions on
     pattern analysis and machine intelligence 16 (1994) 550–554.
 [9] C. Peng, Q. Cheng, Discriminative ridge machine: A classifier for high-dimensional data
     or imbalanced data, IEEE Transactions on Neural Networks and Learning Systems (2020).
     Available online.
[10] V. Verma, A. Lamb, J. Kannala, Y. Bengio, D. Lopez-Paz, Interpolation consistency training
     for semi-supervised learning, in: S. Kraus (Ed.), Proceedings of the 28th International
     Joint Conference on Artificial Intelligence, IJCAI 2019, Macao, China, August 10-16, 2019,
     ijcai.org, 2019, pp. 3635–3641.
[11] A. Rasmus, M. Berglund, M. Honkala, H. Valpola, T. Raiko, Semi-supervised learning with
     ladder networks, in: C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, R. Garnett (Eds.),
     Proceedings of the Annual Conference on Neural Information Processing Systems 2015,
     NIPS 2015, Montreal, Quebec, Canada, December 7-12, 2015, 2015, pp. 3546–3554.
[12] Z. Kang, C. Peng, Q. Cheng, X. Liu, X. Peng, Z. Xu, L. Tian, Structured graph learning for
     clustering and semi-supervised classification, Pattern Recognition 110 (2021) 107627.
[13] Z. Kang, H. Pan, S. C. Hoi, Z. Xu, Robust graph learning from noisy data, IEEE transactions
     on cybernetics 50 (2019) 1833–1843.