Activist: A New Framework for Dataset Labelling

Activist: A New Framework for Dataset Labelling JackO'neill jack.oneill1@mydit.ie Dublin Institute of Technology SarahJaneDelany sarahjane.delany@dit.ie Dublin Institute of Technology BrianMacnamee brian.macnamee@ucd.ie University College Dublin Activist: A New Framework for Dataset Labelling 32BB2DBF8DE864053CCA84014574BB19 GROBID - A machine learning software for extracting information from scholarly documents

Acquiring labels for large datasets can be a costly and timeconsuming process. This has motivated the development of the semisupervised learning problem domain, which makes use of unlabelled data -in conjunction with a small amount of labelled data -to infer the correct labels of a partially labelled dataset. Active Learning is one of the most successful approaches to semi-supervised learning, and has been shown to reduce the cost and time taken to produce a fully labelled dataset. In this paper we present Activist; a free, online, state-of-theart platform which leverages active learning techniques to improve the efficiency of dataset labelling. Using a simulated crowd-sourced label gathering scenario on a number of datasets, we show that the Activist software can speed up, and ultimately reduce the cost of label acquisition.

Introduction

The availability of a large corpus of labelled training data is a key component in developing effective machine learning models. In many cases, such as speech recognition systems and sentiment analysis, labels are time-consuming or expensive to obtain, and must be provided by human annotators, constituting a bottleneck in the predictive model development life-cycle. Recent trends have seen an increased interest in using crowd-sourcing platforms such as CrowdFlower3 , and Amazon Mechanical Turk4 to distribute the task of dataset labelling over a large number of anonymous oracles [21]. While crowd-sourced labels may reduce both the cost and time required to obtain a fully labelled dataset, further reductions may be realized by employing active learning to reduce the number of labels required.

The key insight behind active learning is that "a machine learning algorithm can perform better with less training if it is allowed to choose the data from which it learns" [16]. By allowing the active learning system to select the most informative data, and pose queries for labels for this data to the label provider, or oracle, the cost and time required to train an effective machine learning model can be greatly reduced.

Although the actual utility of a label may not be known in advance, an active learning system may employ one or more heuristics to predict the utility of querying for a particular label. This decision-making process, or selection strategy, is a key component of the active learning process. An active learning system begins with a small amount of pre-labelled, -or seed -data, and proceeds in iterations. Through its selection strategy, the system generates a query for a batch of labels from the unlabelled data. These labels are provided by the oracle, and the data is added to the labelled set. The process continues until a pre-determined stopping criterion is reached. A stopping criterion may be a straightforward label budget, or a more complex prediction of the marginal utility of each new label. Once this stopping criterion is met, a predictive model is trained using the set of labelled data. While active learning is primarily used in the context of predictive model generation, these same principles may be applied to a dataset labelling task. The process is carried out as above, but the resulting model is used to predict the labels of the remaining unlabelled data. The output of an active labelling task is, then, a fully labelled, approximately correct dataset.

A dataset labelling task may be seen as an instance of active learning in a pool-based setting, i.e. a setting in which the learner has access to a large, static pool of unlabelled instances from which to generate label requests. By submitting some, but not all data to oracles for labelling, the goal of the active learning system in this context is to reduce the cost accrued and time spent per correct label acquired, while maintaining accuracy. This paper presents Activist, an extensible framework which assists users in all aspects of the data labelling process. As well as giving users the ability to configure an active labelling task, Activist provides a front-end UI for providing labels to the active learning system. The system covers all aspects of the dataset labelling process from loading and pre-processing the data, to creating a fully labelled output dataset once the process is complete. In addition to assisting users in producing fully labelled datasets, Activist allows multiple active learning strategies to be compared on simulated dataset labelling tasks, creating a detailed performance analysis for each approach under examination.

In this paper we describe the Activist system, and show how it can be used in an evaluation investigating the cost-benefit of applying active learning to a number of dataset labelling tasks. We show that while the impact of active labelling varies depending on the task, an active labelling approach consistently outperforms full dataset labelling.

The rest of the paper is structured as follows: Section 2 discusses related research in the areas of active learning and cost-sensitive labelling; Section 3 describes the Activist framework, and how it can be used to support the active learning process; Section 4 evaluates the use of Activist on a number of datasets, exploring the cost-benefits of applying active learning to a dataset labelling task; finally, Section 5 discusses the findings, suggesting avenues for future research. This paper examines the use of active learning in a pool-based setting, i.e. a setting in which the learner has access to a large, static pool of unlabelled instances from which to generate label requests. The problem of pool-based active learning was introduced by Lewis and Gale [10] in response to the need to develop text classification models for document retrieval. One of the key components which differentiates approaches to active learning is the selection strategy -the heuristic used to predict the informativeness of a particular label. Initial approaches to selection strategies favoured some measure of uncertainty sampling [3,4], selecting those instances for labelling which are closest to the decision boundary of the model, i.e. those which the model was most likely to classify incorrectly.

An alternative selection strategy to uncertainty sampling is the Query-By-Committee (QBC) approach, introduced by Seung et al. [17]. QBC describes a general approach in which a number of diverse classifiers are trained on the currently labelled data, such that the classifiers can be expected to produce slightly different results for each unseen instance. The learner then measures the level of disagreement between the classifiers for each unlabelled instance and selects those instances which induce the highest level of disagreement between the classifiers in the committee. Variations on the QBC algorithm continue to be popular in the literature [5,11].

Although measures of diversity have often been incorporated into other active learning selection strategies [2,8], diversity measurements were first proposed as a sole metric in a selection strategy by Baram et al. [1]. Their Kernel-Farthest-First diversity algorithm seeks to label those instances which are least similar to the currently labelled data. Diversity, as a selection strategy, has been shown to work well in text classification [7], and in regression problems [13].

Research has shown that the labelling process of text-based classification datasets may be made more efficient by using visualisations to assist in the labelling process [19] or by using machine learning techniques to reduce the number of labels required of the annotator [14,9]. Active learning has also been shown to improve the efficiency of dataset labelling for image classification [12], while the availability of commercial platforms such as CrowdFlower attest to the viability of active learning as a dataset labelling tool.

For a more in-depth discussion of the components comprising an active learning system (e.g. selection strategies, stopping criteria, etc.) see [16,8].

Activist

The Activist Framework provides an end-to-end solution for dataset labelling tasks. Using Activist, the dataset labelling process consists of 4 stages:loading, pre-processing, labelling and output. The life-cycle of an Activist task is illustrated in Figure 3.

The simplest data format understood by Activist is the comma-separated values (csv) file. However, for many real-world problems (image or document dataset. Labels are hidden from the system until requested. After each batch of label requests is issued, the chosen predictive model is trained and used to predict the labels of the remaining data. Accuracy and execution times are recorded and returned to the researcher as a csv file when the process is complete, allowing for direct comparison of multiple approaches.

Evaluation

The aim of the evaluation is to explore the potential of the Activist framework to reduce the number of manually required labels needed to produce a fully labelled dataset. This section describes the data and methodology used in the experiment, and reports the findings.

Datasets Used

Three datasets were used in this experiment, the MNist handwriting recognition dataset, the CIFAR-10 image classification dataset and the 20 Newsgroups document classification dataset. The MNist dataset 5 consists of 50,000 28x28 pixel gray-scale images of hand-written digits between 0 and 9. Each image is represented as a pixel map containing the value of each pixel as an unsigned byte. Another image classification dataset, CIFAR-10 6 consists of 60,000 32x32 colour images in 10 equally distributed classes, indicating the content of the image -all subcategories of vehicles and animals, for example: airplane, automobile, bird, cat, dog etc.. Images are represented as a pixel graph containing RGB values for each pixel as unsigned bytes. Rather than using the raw pixel values directly, individual pixels were aggregated into row and column totals for each colour channel, resulting in a vector of 192 features. The 20 Newsgroups7 dataset is a freely available document classification dataset, consisting of approximately 20,000 documents partitioned approximately evenly across 20 different newsgroups. Each document was represented as a bag of words. The data was stemmed, with stop words removed, and words occurring in fewer than 3 separate documents removed as part of the data pre-processing stage. In order to reduce dataset size and the problem complexity, a subsection of the data containing 5 of the 20 newsgroups, -alt.atheism, comp.windows.x, rec.autos, sci.space, talk.politics.guns -was chosen.

Experimental Methodology

The active learning approach used in these experiments was set up using the Activist framework. As part of the task configuration, choices need to be made for the active learning components used in this the task: seed data, a batch size, a selection strategy, a stopping criterion and a predictive model algorithm. The following system was used for each of the datasets under consideration.

Seed Data: 50 initial labels were randomly selected and provided to the active learning system as seed data Batch Size: To keep the batch sizes roughy proportional to the size of the datasets, the MNist dataset used a batch size of 10, while the CIFAR-10 and 20 Newsgroups datasets were evaluated with a batch size of 50

Stopping Criterion: The active learning loop was run until no unlabelled data remained, with performance recorded after each batch was complete.

Selection Strategies: A Query-by-Committee algorithm, using a committee of 5 k-nearest neighbour models was created, using k=5, with each committee member trained on a subset consisting of 80% of the data, selected randomly with replacement. An alternative, diversity-based selection strategy was also employed, using cosine distance as its distance metric. Finally, a random selection strategy, which makes no effort to select the best labels for querying, was evaluated as a baseline for selection strategies.

Predictive Model: A k-nearest neighbour predictive model with k=5 was used to classify the remaining unlabelled data, after each iteration.

After each new batch of labels was added to the labelled dataset, a predictive model was trained using the currently labelled data, and used to predict the labels of the remaining unlabelled data. The number of correct labels (labels provided by oracle + correctly predicted labels) was recorded at each step.

Findings

Figure 3 shows the results of the experiment. After each batch of labels was requested, a predictive model was trained using the currently labelled data, and used to predict labels for the as-yet unlabelled data. The overall accuracy is recorded on the y-axis while the number of labels provided by the oracle is recorded on the x-axis. The black dashed line represents the accuracy obtained in the absence of an active labelling system i.e. the number of labels provided by the oracle. The difference on the y-axis between the dashed and solid lines represents the accuracy-gain provided by the active labelling framework.

The MNist dataset demonstrates that Activist can significantly improve the labelling rate of some datasets. Although less pronounced, the CIFAR10 and Newsgroups datasets benefit from employing active labelling techniques. These results also show that the benefit gained from active labelling is dependent on the characteristics of the dataset being used on the related prediction problem. The results show that in all cases, a random selection strategy can yield demonstrable performance benefits over manual labelling, represented by the x=y baseline. This indicates that, although the performance of the Activist system differs depending on the selection strategy chosen, applying active learning techniques to dataset labelling yields a visible performance improvement irrespective of the particular selection strategy used.

Conclusions and Future Work

This paper presented Activist, a platform for applying active learning techniques to the problem of dataset labelling. Activist reduces the amount manual dataset labelling required to produce a fully labelled, approximately correct dataset. The Activist platform is under active development and is available for download online 8 .

This evaluation has demonstrated the potential benefits of applying active learning to dataset labelling. Future work will expand the capabilities of the framework to further facilitate labelling large datasets. In order to take advantage of the benefits of crowd-sourced labelling, future work will incorporate an API to allow users to obtain labels from on-line crowdsourcing platforms.

The Activist framework will be expanded to include a wider variety of active learning components, particularly predictive models. Convolutional neural networks have been shown to be effective at classifying the CIFAR10 dataset [18,6], while SVMs have been shown to work well classifying the 20 newsgroups dataset [15]. The inclusion of a wider range of predictive models is anticipated to yield a greater benefit for a larger number of datasets.

In its current format, the Activist system relies on a single label per instance. This approach is known to be problematic due to errors or subjectivity in the labelling process. Strategies for coping with this problem have been discussed in further detail by Tarasov [20]. Future work will aim to allow the Activist system to handle multiple responses per instance in an effort to mitigate the impact of subjectivity and rater unreliability on the labelling process.

The experiment has shown that the performance of active labelling depends to some extent on the selection strategies used. This suggests that a deeper investigation of the relative impact of all active learning components may prove promising. In addition to adding a wider range of components to the Activist platform, we hope to develop heuristics which will guide users in tailoring an active learning task to the problem at hand.

Fig. 1 .1Fig. 1. Flow diagram illustrating the life-cycle of an Activist task

Fig. 3 .3Fig. 3. Graphs showing the accuracy achieved per labels requested on each of the datasets examined. The dashed black line represents the number of correct labels in the absence of an active labelling system.

https://www.crowdflower.com https://www.mturk.com/mturk/welcome http://yann.lecun.com/exdb/mnist/ https://www.cs.toronto.edu/ ~kriz/cifar.html http://qwone.com/ ~jason/20Newsgroups/ https://github.com/joneill87/Activist

Online choice of active learning algorithms YBaram REl-Yaniv KLuz The Journal of Machine Learning Research 5 2004 Incorporating diversity in active learning with support vector machines KBrinker ICML 3 2003 Active learning with statistical models DACohn ZGhahramani MIJordan Journal of artificial intelligence research 1996 Minimizing manual annotation cost in supervised training from corpora SPEngelson IDagan Proceedings of the 34th annual meeting on Association for Computational Linguistics the 34th annual meeting on Association for Computational Linguistics Association for Computational Linguistics 1996 Query by committee made real RGilad-Bachrach ANavot NTishby Advances in neural information processing systems 2005 Spatially-sparse convolutional neural networks BGraham CoRR abs/1409.6070 2014 Active learning for text classification RHu 2011 Dublin Institute of Technology Egal: Exploration guided active learning for tcbr RHu SJDelany BMac Namee International Conference on Case-Based Reasoning Springer 2010 Sweetening the dataset: Using active learning to label unlabelled datasets RHu BMac Namee SJDelany 2008 A sequential algorithm for training text classifiers DDLewis WAGale Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval the 17th annual international ACM SIGIR conference on Research and development in information retrieval Springer-Verlag New York, Inc 1994 Active learning for cross-domain sentiment classification SLi YXue ZWang GZhou IJCAI 2013 Multilabel svm active learning for image classification XLi LWang ESung ICIP'04. 2004 International Conference on IEEE 2004. 2004 4 Image Processing An evaluation of selection strategies for active learning with regression JO'neill 2015 Semi-automated annotation and active learning for language documentation AMPalmer 2009 Less is more: Active learning with support vector machines GSchohn DCohn ICML Citeseer 2000 Active learning BSettles Synthesis Lectures on Artificial Intelligence and Machine Learning 2012 6 Query by committee HSSeung MOpper HSompolinsky Proceedings of the fifth annual workshop on Computational learning theory the fifth annual workshop on Computational learning theory ACM 1992 JTSpringenberg ADosovitskiy TBrox MRiedmiller arXiv:1412.6806 Striving for simplicity: The all convolutional net 2014 arXiv preprint Articulate: A semi-automated model for translating natural language queries into meaningful visualizations YSun JLeigh AJohnson SLee International Symposium on Smart Graphics Springer 2010 Dynamic estimation of rater reliability using multi-armed bandits ATarasov 2014 A survey of crowdsourcing systems MCYuen IKing KSLeung Privacy, Security, Risk and Trust (PASSAT) and 2011 IEEE Third Inernational Conference on Social Computing (SocialCom) IEEE 2011. 2011 IEEE Third International Conference on