Active Stream Learning with an Oracle of Unknown Availability for Sentiment Prediction Elson Serrao and Myra Spiliopoulou Otto-von-Guericke-University Magdeburg, Germany elson.serrao@gmail.com, myra@ovgu.de Abstract. Active learning holds the promise of learning models from the data with minimal expert input. However, it assumes that the expert is always available or only at the beginning. We waive this assumption and investigate to what extent active learning is effective in practice. We focus on sentiment classification over real streams of opinions. We show that at least for the two real streams we have analyzed, the random strategy is very competitive, and querying the expert in an intelligent way does not bring many advantages, at least when the expert is irregularly available. Keywords: active learning, oracle availablity, polarity model learning, opinion stream mining 1 Introduction The objective of active learning is to obtain better or comparable performance to a fully supervised learner with fewer labels if the learner is given the oppor- tunity to select the instances for which it requires labels [1]. Active learning is thus very suitable in those scenarios where there is an abundance of unlabeled data and obtaining new labels is rather expensive. Labels are obtained using an oracle who can, for example, be a domain expert or a human annotator from a crowd-sourcing platform. However, it is often assumed that there is a single oracle that is always correct, always available and inexpensive to query. While there are surveys [1], [2], and [3] and studies [4], [5] that provide insights to the above mentioned challenges in active learning, only a few studies focus on the availability of the oracle for streaming data [6]. In this paper, we consider a stream of opinionated documents and try to predict the sentiment of the document as being either positive, negative or neu- tral. Over time drift may be observed in the stream due to evolving topics, data and vocabulary, requiring the classifier to adapt to the opinionated stream. For this we use active learning to obtain new labels from the data stream. Instances to be labeled by the oracle are sampled using an appropriate query strategy. However, we assume that the oracle is available irregularly i.e. according to a pattern unknown to the learner. This implies that the oracle may be queried at each moment, but will respond by delivering the label only if it is available. 36 Active Stream Learning with an Oracle of Unknown Availability Fig. 1. Interaction of the stream learner and an irregularly available oracle If the oracle is unavailable, the instance is not used. This workflow is shown in Figure 1. The remaining of the paper is organized as follows. Section 2 discusses the related work. In Section 3 we detail our framework and the active learning query strategies used. Section 4 describes the setup for our experiments. While in Section 5 we discuss the results of those experiments. We present our concluding remarks in Section 6. 2 Related Work Most of the algorithms for active learning on streams either assume infinite verification latency, whereupon they invoke semi-supervised learning [7] or they assume that the Oracle is always available to provide labels [8], [9], [10], [11], [12]. Shickel and Rashidi in [6] propose a framework that is aware of the oracle’s availability for data streams. Their framework considers multiple oracles and fo- cuses on querying first those oracles that have a higher availability. They try to achieve a cost-benefit tradeoff by using a dynamic labeling budget proportional to the oracle’s availability with the cost of labeling an instance inversely pro- portional to the oracle’s availability. However, such a cost-benefit tradeoff seems unrealistic in real-world scenarios where the cost is likely to be regulated by the difficulty in obtaining the label and other factors [5]. No experiments were con- ducted with oracles of varying expertise, which is often seen in active learning literature considering multiple oracles. 3 Active Learning on an Opinionated Stream We consider a data stream D of opinionated documents observed at distinct timepoints t0 , t1 , . . . , ti , . . . where at each timepoint ti we receive a batch of documents. We define the timepoint on a temporal level where, for example, the timepoint could be a week. Consequently, all the documents arriving during the time period from ti−1 to ti would comprise the batch of documents for ti . 37 Active Stream Learning with an Oracle of Unknown Availability Our framework encompasses an algorithmic core described in Section 3.1 and the query strategies in Section 3.2. We link this framework with a simulator of the oracle’s availability, described in Section 4.1, and with an algorithm for the preparation of the opinionated data stream, described in Section 4.2. 3.1 Modeling Oracle Unavailability during Querying We consider the beginning of the stream at t0 to be characterized by the avail- ability of an initial set of labeled documents L0 . The initially labeled documents L0 are used to initialize the classifier ∆. At subsequent timepoints i.e. t1 on- wards, we receive unlabeled data Ut . If the budget B is not exceeded, for every unlabeled document x, we use the trained classifier ∆ to predict the probability P (ŷc |x) ∀c ∈ C, where c represents the sentiment of the document, namely, pos- itive, negative or neutral. Our method calculates the confidence of the learner’s prediction I using metric φ, and launches a request for the true label y when necessary. The oracle provides the true label y only if it is available and the document x is added to the labeled data for the next iteration. In the event that the oracle is unavailable, x is not used to adapt the learner. An overview of our framework is shown in Algorithm 1. We assume the cost of labeling is the same for every document at any time t. If nt is the number of queries sent to the oracle at t, then the utilized budget at t is given by nt