1. Introduction

A Meta-Active Learning approach exploiting Instance Importance based on Learning Gradient Variation

(Discussion Paper)

Sergio Flesca

Domenico Mandaglio

Francesco Scala

Andrea Tagarelli

0 0 DIMES Dept., University of Calabria , 87036 Rende (CS) , Italy

A major challenge in active learning is to select the most informative instances to be labeled by an annotation oracle at each step. In this respect, one efective paradigm is to learn the active learning strategy that best suits the performance of a meta-learning model. This strategy first measures the quality of the instances selected in the previous steps and then trains a machine learning model that is used to predict the quality of instances to be labeled in the current step. In this paper, we discuss a new approach of learning-to-active-learn that selects the instances to be labeled as the ones producing the maximum change to the current classifier. The key idea is to select such instances according to their importance reflecting variations in the learning gradient of the classification model. Our approach can be instantiated with any classifier trainable via gradient descent optimization, and here we provide a formulation based on a deep neural network model, which has not deeply been investigated in existing learning-to-active-learn approaches. The experimental validation of our approach has shown promising results in scenarios characterized by relatively few initially labeled instances.

eol>meta-learning models model-change framework learning to active learn active learning

1. Introduction

Supervised machine learning methods typically require a large number of training data instances. However, manually labeling training instances is a costly and time consuming process, especially for specialized domains, where a deep expertise is required for correctly labeling data instances. Active Learning aims at selecting the data instances to be labeled by an expert, or annotation oracle, in order to train a machine learning model as quickly and efectively as possible. Several strategies have been proposed in the literature [ 1 ], which select the instances to be provided to the oracle for annotation using diferent heuristics; however, none of such heuristics has shown to outperform the others in every scenario of interest. To overcome major limitations, meta-active learning approaches have been proposed to automatically detect the best strategy of selection of the instances to be annotated [ 2, 3, 4 ].

In this paper, we discuss the main contributions from our earlier study [ 5 ], where we introduced a new meta-active learning method whose instance selection step, modeled as a regression problem, exploits the training gradient of a deep neural network model, and in general of any machine learning model whose training is based on a gradient descent method. Experiments conducted on CIFAR-10 image data, and including a comparison with some baselines, have shown promising results by the proposed approach in terms of percentage increase in accuracy.

2. Related Work

Active learning methods typically fall into one of the following categories: Uncertainty Sampling, Query-By-Committee, Expected Model Change, Expected Error Reduction, Variance Reduction, and Density-Weighted [ 1 ].

Uncertainty sampling aims to improve the quality of the labeled dataset by selecting as instances to be labeled those such that the trained classifier is most uncertain in assigning a class label [ 6, 7 ]. Among this class of methods, the most popular one is probably least confidence sampling (LCS) [ 8 ] which uses as uncertainty measure for an instance the diference between 100% confidence and the most confidently predicted label for the instance. Other approaches use diferent multi-class uncertainty sampling variants, such as margin sampling [ 9 ] or entropy [ 8 ].

The query-by-committee approach [ 10 ] maintains a set of prediction models, or committee, that are used to predict the label of an instance. The instance over which there is the maximum disagreement on the labels predicted by the models in the committee is regarded as the most informative and hence selected for labeling. Several specializations of the approach have been proposed using diferent models for the committee members [ 11, 12, 13, 14 ].

The expected model change framework [ 15 ] aims to define a strategy for selecting the instance that would yield the greatest change to the current model if we knew its label. The strategy computes the expected gradient length and uses it as a measure of the expected change to the model that is associated to the labeling of an instance. The key idea is to prefer instances that are likely to have the greatest influence in changing the model. Theoretical aspects of this framework have been well studied for support vector machines and linear regression [ 16 ], although it can be computationally expensive for large feature space and set of labelings.

Expected error reduction aims to select the instance that yields the maximum reduction of the model generalization error once it is trained using the label of too. However, since the labels of some instances are not known, the model is usually approximated using the expectation over all possible labels under the current model. This framework has been successfully used with a variety of models such as Naıve Bayes [ 17 ], logistic regression [ 18 ], and SVM [19].

Variance reduction methods reduce the generalization error indirectly by minimizing output variance. The early method in [20] was proposed for active learning based on the reduction of the estimated distribution of the model’s output for regression. Applications of variance reduction include multi-class image classification [21].

The key idea of density-weighted methods is that informative instances should not only be the uncertain ones, but also those representative of the underlying distribution [ 11, 22, 23, 8, 24 ]. Hence, the instances are selected according to both a base selection measure (e.g., LCS) and a density based measure (e.g., the average similarity of an instance w.r.t. the other instances).

Meta-learning algorithms have recently been proposed for the active learning tasks. In [ 2, 3 ], several active learning heuristics are combined using a bandit algorithm exploiting a maximum entropy criterion that estimates classification performance without knowing the actual labels. Rather than combining existing heuristics, the meta-learning approach to active learning in [25] models the active learning task as a regression problem: given a trained classifier and its output for a specific unlabeled instance, it predicts the reduction in generalization error that can be expected by providing the actual label of the instance. Note that the regressor in [25] is required to be trained on a specific set of instance-driven features, such as the variance of the classifier output for the instance or the predicted probability distribution over possible labels for the instance. Our approach does not have the same constraint, since we utilize the raw features of the instances, yet we can in principle exploit instance-driven features. More importantly, for each active learning epoch, [25] requires to perform several training steps of the classifier while we perform just a single training step.

3. Proposed Approach

A classification problem consists in associating every instance taken from a predefined domain with a label selected from a fixed domain of labels ℒ. We assume the presence of a set of instance-label pairs ⊆ × ℒ and a set of unlabeled instances ⊆ , where for each pair ⟨, ⟩ ∈ , is an instance in and is the label associated with .

Algorithm 1 shows the general schema of the proposed approach, named Learning to Active Learn by Instance Importance based Gradient Variation (LAL-IGradV). LAL-IGradV receives in input a (small) set of labeled instances , a set of unlabeled instances , a deep neural network model DNN, a regressor model , the number ℎ of active learning epochs, and the number of unlabeled instances to select for oracle labeling at each active learning epoch.

Our proposed approach is comprised of two phases: initialization and an iterative phase. In the initialization phase, the algorithm first trains DNN using (line 1), randomly selects unlabeled instances from and asks the oracle to label them, thus obtaining the initial set of oracle-labeled instances (lines 2- 3). In each step of the iterative phase (lines 4-11), the set of newly labeled instances is used to train the classifier together with the set (line 5). When retraining the classifier, every instance ∈ is associated with its importance score . The computation of the importance scores of the instance in is performed using one of the techniques described in Section 3.1. Next, a regressor is trained on the set {(, )| ∈ } and NLI instances are added to LS (lines 6-7). The regressor is then applied to the instances in so that, given an instance , it predicts its importance score (line 8). Finally, the top- ̂︀ instances having the greatest importance score are selected for oracle labeling and, once labeled, they replace the set so to start the next active learning step (lines 9-11).

Following the model change framework [ 15 ], the importance score of an instance measures the impact of having in the training set for the obtained classifier. That is, the importance score of a (labeled) instance w.r.t. a set of labeled instances is a measure of the diference between the parameters of the classifier trained over and the parameters of the classifier ^ trained over ∪{⟨, ⟩}, where is the label of . Unfortunately, in the case of neural network classifiers, for the most commonly used training methods, such as the stochastic gradient, such diference between the parameters of the model (almost) does not exist. To overcome this issue, we define diferent notions of importance score, as discussed next.

Algorithm 1: LAL-IGradV

Data: : set of labeled instances, : set of unlabeled instances, DNN: deep neural network model, : importance score regressor, ℎ: maximum number of epochs, : number of relevant instances to select 1 Train DNN on 2 ← 3 The oracle annotates the instances in 4 for = 1 . . . ℎ do

Select instances from uniformly at random 5 6 7 8 9 10 11

Train on the set of pairs {⟨, ⟩ | ∈ } Train DNN on ∪ and compute importance score , for each ∈

∪ ← ← ← Apply to instances to predict importance scores ()

Select top- instances from by importance score The oracle annotates the instances in ̂︀ ̂︀ 3.1. Importance scoring strategies ∈ is associated to a label .

The training of the DNN over requires solving Let (, ) be the output of a DNN model characterized by a vector of parameters for an input and let = {1, . . . , } be a set of instances used for training , where each sample ⎛ arg min ⎝

⎞ ∑︁ ((, (, )) + ( )⎠ , ∈ where (, (, )) is the loss of the model for instance and ( ) is the regularization of the parameters. The training of is done by iteratively updating the parameters , through two steps: (i) computing the change in w.r.t. all parameters, i.e., the gradient, defined as follows ∈ () =

∑︁ ((, (, )) + ( )), and (ii) updating using (), i.e., +1 = − × (), where is the update step size.

We define four strategies to associate each instance in with its importance score during the training of the classifier. The goal shared by the various techniques is to modify the training of the neural network model by accounting for the importance of the instances in involved in each training step. Each of the proposed techniques makes use of the gradient corresponding to the instances currently in and , i.e., ( ∪ ), hereinafter simply denoted as . The four proposed techniques difer in the way the importance of an instance in is calculated with respect to the single epoch. We will use symbol to denote the value of the gradient ({}), and ¬ to denote the value of the gradient ( ∪ ∖ {}). In the following, we describe our proposed techniques for computing the importance scores.

Direct similarity (DS) – given an instance in , this strategy compares the learning gradient of the neural network at the current epoch, , with the gradient calculated with respect to only, i.e., . The importance score of at the current epoch is defined as the cosine similarity between and , i.e., = (, ). The rationale of this strategy is that an instance ∈ is likely to be more important for the training of at the current epoch if there is a small diference between the directions of the gradients and , as reflected by a high value of the cosine similarity between the two gradients. That is, the more the learning behavior of the neural network considering the whole training set is similar to the one of the same neural network trained on only, the higher the importance of is.

Ranked direct similarity (RDS) – this strategy first applies the DS technique, then the importance scores of the instances in computed by DS are ordered and divided into three bins, which correspond to the top quartile of the importance scores, the bottom quartile, and the union of the second and third quartiles. The instances falling into the top quartile will be associated with score 1, the ones falling into the bottom quartile with score 0, and the other instances with score 0.5.

Leave-one-out distance (LD) – given an instance in , this strategy compares with the gradient calculated when leaving out , i.e., ¬. The importance score of at the current epoch is defined as the complement of the cosine similarity (i.e., cosine distance) between and ¬, i.e., = 1 − (, ¬). The rationale of this strategy is that an instance ∈ is likely to be more important for the training of at the current epoch if leaving it out will lead to large diferences between the learning behavior of the neural network considering the whole training set and the learning behavior of the same neural network trained without , i.e., a large change in the direction of the gradient ¬ w.r.t. the gradient , as reflected by a high value of the cosine distance between the two gradients.

Ranked leave-one-out distance (RLD) – analogously to RDS w.r.t. DS, the RLD strategy adds the same discretization step over the importance scores computed by LD.

4. Experimental Evaluation

Data. We used the well-known CIFAR-10 dataset [26], which consists of 60000 instances representing 32x32 colour images, labeled using 10 mutually exclusive classes, with 6000 images per class. The dataset is organized into 50000 instances as the training set and 10000 instances as the test set. The latter contains exactly 1000 randomly-selected images from each class, while the training set is comprised of five training batches, which contain 5000 images from each class. We divided the training set into two parts, the one corresponding to the set of labeled instances (), and the other corresponding to the set of unlabeled instances ( ).

Baseline methods. We compare the performance of our methods with a Random baseline and the LCS method [ 8 ]. The Random baseline, hereinafter denoted as Rnd, simply selects instances to be annotated at each epoch uniformly at random from the set of unlabeled instances. The LCS method follows an uncertainty sampling approach, therefore the unlabeled instance selection is driven by the uncertainty of the instances. More precisely, given an instance and a classification model , the LCS method measures the uncertainty of w.r.t. (()) as () = (1 − (* |)) × − 1 , where (* |) denotes the probability that the model assigns to the label * for the instance , * is the label for which yields the maximum probability on (i.e., * = arg max (|)), and is the cardinality of the set of labels. Note that the uncertainty function ranges between [ 0, 1 ], where 1 is the most uncertain score.

Settings and assessment criteria. In our experimental evaluation, we used 6 Convolutional Neural Network (CNN) 2D layers, with 3 input channels, kernel size 3, stride size 3, padding size 1, ReLU activation function. The CNN module has on top a fully-connected network with an input layer of size 4096, one hidden layer with input size 4096 and output size 1024, another hidden layer with input size 1024 and output size 512, an output layer of size 10 (i.e., number of classes), and a dropout layers with probability 0.1.

In our LAL-IGradV algorithm, the DNN model was trained using cross entropy as loss function and Adam optimizer (with learning rate 1e-4 and weight decay 5e-4), a number of epochs equal to 10 for both the initialization step of training (Line 1) and the training steps in the main loop (Line 5). Also, the maximum number of iterations of the algorithm, i.e., number of epochs in the active learning process (ℎ) was set to 10. Unless otherwise specified, the number of instances to select from was set to 500; the size of , resp. , was experimentally varied. As the regressor (), we used two models: the Gradient Boosting Regressor, with least absolute deviations (LAD) loss function and 200 estimators, for the DS and LD strategies, and the Random Forest Classifier, with maximum depth 5, for the RDS and RLD strategies.

To simulate the oracle for annotating the instances, we resorted to the availability of class label information for the CIFAR-10 data: whenever an instance was used in the set, we masked its actual label during the learning process, and we unveiled the label only if the instance was selected within the set of instances to annotate.

To assess the performance of the methods, we considered the accuracy of the classifier during the various training batches, in absolute terms as well as in terms of percentage increase w.r.t. the early accuracy of the classifier itself or the accuracy of a reference method. More precisely, we computed: the accuracy at the initial step of training of LAL-IGradV (line 1), denoted as (0), and the accuracy at the end of the active learning process, denoted as ; the percentage increase in the accuracy of LAL-IGradV, which is defined as 100( − (0))/(0); the percentage increase in the accuracy of LAL-IGradV w.r.t. Rnd, resp. LCS, which is defined as %Rnd = 100( − Rnd)/Rnd, resp. %LCS = 100( − LCS)/LCS, where Rnd and LCS denote the accuracy at the end of the active learning process for Rnd and LCS.

Results. Table 1 reports on the performance of our LAL-IGradV variants corresponding to the four importance scoring techniques, for varying percentages of the set of unlabeled instances ( ). As expected, the accuracy values (i.e., columns corresponding to and (0)) tend to decrease as the percentage of unlabeled instances gets higher, since the LAL-IGradV method is forced to handle progressively reduced sets of labeled instances on its initial training. More interestingly, the percentage increase of each of the LAL-IGradV variants w.r.t. both Rnd and LCS is always positive — up to 6.5% against Rnd and up to 3.2% against LCS — and it tends to improve with higher percentages of unlabeled instances, with peaks around 70% against Rnd and around 50-60% against LCS. As concerns the impact of the importance scoring technique, we observe that all the LAL-IGradV variants are able to improve upon the accuracy at the initial training step. Moreover, the direct similarity based techniques, i.e., DS and RDS, reveal to be more eficient 1 as well as more accurate than the leave-one-out distance based techniques, for each percentage of unlabeled set. We tend to ascribe this fact to a higher sensitivity of the approach in capturing the gradient direction change due to the individual contribution of an instance rather than to the masking of a single instance in the training gradient, which would result in a more diluted signal of variation of the training gradient.

We analyzed the percentage increase in accuracy that each active learning method achieves by varying the fraction of unlabeled instances. As expected due to the advantage of performing an active learning task, the percentage increase values (results not shown) tend to improve for higher fractions of unlabeled instances. The trends are steeper for our LAL-IGradV methods (around 10% increase), particularly for DS and RDS, followed by LCS. Indeed, it is worth emphasizing that our LAL-IGradV methods achieve the best performance gain against the two baselines as the fraction of labeled instances becomes smaller.

In Fig. 1, we delve into the trends of accuracy percentage-increase obtained by a particular active learning method, for varying , i.e., number of unlabeled instances to be selected at each epoch of the active learning process. At a first glance, in each of the plots, we notice that the curve of the percentage increase values over is more likely to change for larger fractions of the set of unlabeled instances, with the most evident changes corresponding to 90%.

A few interesting remarks can be drawn from Figs. 1(a)-(d). When portions of below 90% are selected, we observe a relatively small range of variation of the percentage increase values (approximately from 5% to 10%), with peaks around = 500 for the DS and LD variants, and around = 900 for the RDS and RLD variants. This would hint at higher requirements (i.e., higher ) needed for the importance scoring strategies that compute discretized importance scores. Another remark is on the curves corresponding to the use of 90% of the set of unlabeled instances: compared to the cases with lower fractions of , the percentage increase values are higher on average, and the trends are quite diferent, especially for the DS variant where we observe a minimum (rather than a maximum) for = 500. Apart from this exception, it is worth noticing that better percentage increase of accuracy do not necessarily correspond to a higher number of selected instances. This might be explained since the more unlabeled instances are selected for labeling, the more the method is less likely to make a correct choice for changing the most the current model, as the latter is being trained only on few instances, thus lacking full knowledge on the class distribution of all the instances for available training.

Concerning the baseline methods, two diferent situations occur between the Rnd plot 1Experiments were carried out on an Intel Core i7 CPU @2.90GHz, 32GB RAM, with NVIDIA GeForce RTX 2070 Super GPU 60% unlabeled 70% unlabeled 80% unlabeled 90% unlabeled 60% unlabeled 70% unlabeled 80% unlabeled 90% unlabeled 60% unlabeled 70% unlabeled 80% unlabeled 90% unlabeled 60% unlabeled 70% unlabeled 80% unlabeled 90% unlabeled 60% unlabeled 70% unlabeled 80% unlabeled 90% unlabeled 60% unlabeled 70% unlabeled 80% unlabeled 90% unlabeled (Fig. 1(e)) and the LCS plot (Fig. 1(f)). The former shows a decreasing trend until mid values of (i.e., around 500) followed by a rising trend, which sheds light on the divergent behavior of a random selection of the unlabeled instances w.r.t. all the other instance selection methods. Also, the LCS plot shows curves that tend to monotonically decrease, resp. remain substantially unchanged, for larger, resp. smaller, fractions of , which again puts in evidence how our LAL-IGradV variants behave diferently from an uncertainty sampling approach like LCS.

5. Conclusions

We proposed a learning-to-active-learn approach whose key novelty is twofold: the integration of a regression-based meta-learning approach within a maximum model-change framework, and the definition of policies for scoring the instance importance based on the amount of change in the learning gradient of a deep neural network model. Our experimental evaluation has shown that our proposed LAL-IGradV outperforms both a random baseline and the LCS method, especially when the number of initially available labeled instances gets smaller. As a future work, we plan to evaluate the impact of measuring the importance of an instance not only in terms of its own contribution to the model change but also w.r.t. other instances according to some instance locality principle. 20th International Joint Conference on Artificial Intelligence, 2007, pp. 823–829. [19] R. Moskovitch, N. Nissim, D. Stopel, C. Feher, R. Englert, Y. Elovici, Improving the detection of unknown computer worms activity using active learning, in: Proc. of the 30th Annual German Conference on Artificial Intelligence, volume 4667 of Lecture Notes in Computer Science, Springer, 2007, pp. 489–493. [20] D. A. Cohn, Neural network exploration using optimal experiment design, Neural Networks 9 (1996) 1071–1083. [21] A. J. Joshi, F. Porikli, N. P. Papanikolopoulos, Scalable active learning for multiclass image classification, IEEE Trans. Pattern Anal. Mach. Intell. 34 (2012) 2259–2273. [22] H. T. Nguyen, A. W. M. Smeulders, Active learning using pre-clustering, in: Proc. of the

Twenty-first International Conference on Machine Learning, 2004. [23] Z. Xu, R. Akella, Y. Zhang, Incorporating diversity and density in active learning for relevance feedback, in: Proc. of the 29th European Conference on Information Retrieval, volume 4425 of Lecture Notes in Computer Science, Springer, 2007, pp. 246–257. [24] S. Huang, R. Jin, Z. Zhou, Active learning by querying informative and representative examples, IEEE Trans. Pattern Anal. Mach. Intell. 36 (2014) 1936–1949. [25] K. Konyushkova, R. Sznitman, P. Fua, Learning active learning from data, in: Proc. of the

Annual Conference on Neural Information Processing Systems, 2017, pp. 4225–4235. [26] K. Alex, H. Geofrey, Learning multiple layers of features from tiny images (2009).

[1]

Settles , Active Learning Literature Survey , Technical Report , University of WisconsinMadison Department of Computer Sciences, 2009 .

[2]

Baram ,

El-Yaniv ,

Luz , Online choice of active learning algorithms , J. Mach. Learn. Res . 5 ( 2004 ) 255 - 291 .

[3]

Hsu ,

Lin , Active learning by learning , in: Proc. of the Twenty-Ninth AAAI Conference on Artificial Intelligence , 2015 , pp. 2659 - 2665 .

[4]

Ebert ,

Fritz , B. Schiele, RALF: A reinforced active learning formulation for object class recognition , in: Proc. of the IEEE Conference on Computer Vision and Pattern Recognition , 2012 , pp. 3626 - 3633 .

[5]

Flesca ,

Mandaglio ,

Scala ,

Tagarelli , Learning to active learn by gradient variation based on instance importance , in: 2022 26th International Conference on Pattern Recognition (ICPR) , 2022 , pp. 2224 - 2230 . doi: 10 .1109/ICPR56361. 2022 . 9956039 .

[6]

D. D.

Lewis ,

W. A.

Gale , A sequential algorithm for training text classifiers , in: Proceedings of the 17th Annual International ACM-SIGIR Conference on Research and Development in Information Retrieval , 1994 , pp. 3 - 12 .

[7]

D. D.

Lewis ,

Catlett , Heterogeneous uncertainty sampling for supervised learning , in: Proc. of the Eleventh International Conference on Machine Learning , 1994 , pp. 148 - 156 .

[8]

Settles ,

Craven , An analysis of active learning strategies for sequence labeling tasks , in: Proc. of the 2008 Conference on Empirical Methods in Natural Language Processing , 2008 , pp. 1070 - 1079 .

[9]

Schefer ,

Decomain ,

Wrobel , Active hidden markov models for information extraction , in: Proc. of the 4th International Conference on Advances in Intelligent Data Analysis , volume 2189 of Lecture Notes in Computer Science, Springer, 2001 , pp. 309 - 318 .

[10]

H. S.

Seung ,

Opper ,

Sompolinsky , Query by committee , in: Proc. of the Fifth Annual ACM Conference on Computational Learning Theory , 1992 , pp. 287 - 294 .

[11]

McCallum ,

Nigam , Employing

and pool-based active learning for text classification , in: Proc. of the Fifteenth International Conference on Machine Learning , 1998 , pp. 350 - 358 .

[12]

Dagan ,

S. P.

Engelson , Committee-based sampling for training probabilistic classifiers , in: Proc. of the Twelfth International Conference on Machine Learning , 1995 , pp. 150 - 157 .

[13]

Melville ,

R. J.

Mooney , Diverse ensembles for active learning , in: Proc. of the Twenty-first International Conference on Machine Learning , 2004 . doi: 10 .1145/1015330.1015385.

[14]

Gilad-Bachrach ,

Navot ,

Tishby , Query by committee made real , in: Proc. of the Neural Information Processing Systems , 2005 , pp. 443 - 450 .

[15]

Settles ,

Craven ,

Ray , Multiple-instance active learning , in: Proc. of the Twenty-First Annual Conference on Neural Information Processing Systems , 2007 , pp. 1289 - 1296 .

[16]

Cai ,

Zhang ,

Zhou ,

Wang ,

Chen ,

Ding , Active learning for classification with maximum model change , ACM Trans. Inf. Syst . 36 ( 2017 ).

[17]

Roy ,

McCallum , Toward optimal active learning through sampling estimation of error reduction , in: Proc. of the Eighteenth International Conference on Machine Learning , 2001 , pp. 441 - 448 .

[18]

Guo ,

Greiner , Optimistic active-learning using mutual information , in: Proc. of the