=Paper= {{Paper |id=Vol-2192/ialatecml_paper2 |storemode=property |title=Active Stream Learning with an Oracle of Unknown Availability for Sentiment Prediction |pdfUrl=https://ceur-ws.org/Vol-2192/ialatecml_paper2.pdf |volume=Vol-2192 |authors=Elson Serrao,Myra Spiliopoulou |dblpUrl=https://dblp.org/rec/conf/pkdd/SerraoS18 }} ==Active Stream Learning with an Oracle of Unknown Availability for Sentiment Prediction== https://ceur-ws.org/Vol-2192/ialatecml_paper2.pdf
   Active Stream Learning with an Oracle of
 Unknown Availability for Sentiment Prediction

                       Elson Serrao and Myra Spiliopoulou

                Otto-von-Guericke-University Magdeburg, Germany
                     elson.serrao@gmail.com, myra@ovgu.de



      Abstract. Active learning holds the promise of learning models from
      the data with minimal expert input. However, it assumes that the expert
      is always available or only at the beginning. We waive this assumption
      and investigate to what extent active learning is effective in practice.
      We focus on sentiment classification over real streams of opinions. We
      show that at least for the two real streams we have analyzed, the random
      strategy is very competitive, and querying the expert in an intelligent way
      does not bring many advantages, at least when the expert is irregularly
      available.

      Keywords: active learning, oracle availablity, polarity model learning,
      opinion stream mining


1   Introduction

The objective of active learning is to obtain better or comparable performance
to a fully supervised learner with fewer labels if the learner is given the oppor-
tunity to select the instances for which it requires labels [1]. Active learning is
thus very suitable in those scenarios where there is an abundance of unlabeled
data and obtaining new labels is rather expensive. Labels are obtained using an
oracle who can, for example, be a domain expert or a human annotator from
a crowd-sourcing platform. However, it is often assumed that there is a single
oracle that is always correct, always available and inexpensive to query. While
there are surveys [1], [2], and [3] and studies [4], [5] that provide insights to the
above mentioned challenges in active learning, only a few studies focus on the
availability of the oracle for streaming data [6].
    In this paper, we consider a stream of opinionated documents and try to
predict the sentiment of the document as being either positive, negative or neu-
tral. Over time drift may be observed in the stream due to evolving topics, data
and vocabulary, requiring the classifier to adapt to the opinionated stream. For
this we use active learning to obtain new labels from the data stream. Instances
to be labeled by the oracle are sampled using an appropriate query strategy.
However, we assume that the oracle is available irregularly i.e. according to a
pattern unknown to the learner. This implies that the oracle may be queried
at each moment, but will respond by delivering the label only if it is available.


                                          36
Active Stream Learning with an Oracle of Unknown Availability




     Fig. 1. Interaction of the stream learner and an irregularly available oracle
If the oracle is unavailable, the instance is not used. This workflow is shown in
Figure 1.
    The remaining of the paper is organized as follows. Section 2 discusses the
related work. In Section 3 we detail our framework and the active learning query
strategies used. Section 4 describes the setup for our experiments. While in
Section 5 we discuss the results of those experiments. We present our concluding
remarks in Section 6.


2    Related Work
Most of the algorithms for active learning on streams either assume infinite
verification latency, whereupon they invoke semi-supervised learning [7] or they
assume that the Oracle is always available to provide labels [8], [9], [10], [11],
[12].
    Shickel and Rashidi in [6] propose a framework that is aware of the oracle’s
availability for data streams. Their framework considers multiple oracles and fo-
cuses on querying first those oracles that have a higher availability. They try to
achieve a cost-benefit tradeoff by using a dynamic labeling budget proportional
to the oracle’s availability with the cost of labeling an instance inversely pro-
portional to the oracle’s availability. However, such a cost-benefit tradeoff seems
unrealistic in real-world scenarios where the cost is likely to be regulated by the
difficulty in obtaining the label and other factors [5]. No experiments were con-
ducted with oracles of varying expertise, which is often seen in active learning
literature considering multiple oracles.


3    Active Learning on an Opinionated Stream
We consider a data stream D of opinionated documents observed at distinct
timepoints t0 , t1 , . . . , ti , . . . where at each timepoint ti we receive a batch of
documents. We define the timepoint on a temporal level where, for example, the
timepoint could be a week. Consequently, all the documents arriving during the
time period from ti−1 to ti would comprise the batch of documents for ti .


                                          37
Active Stream Learning with an Oracle of Unknown Availability

   Our framework encompasses an algorithmic core described in Section 3.1 and
the query strategies in Section 3.2. We link this framework with a simulator of
the oracle’s availability, described in Section 4.1, and with an algorithm for the
preparation of the opinionated data stream, described in Section 4.2.

3.1   Modeling Oracle Unavailability during Querying
We consider the beginning of the stream at t0 to be characterized by the avail-
ability of an initial set of labeled documents L0 . The initially labeled documents
L0 are used to initialize the classifier ∆. At subsequent timepoints i.e. t1 on-
wards, we receive unlabeled data Ut . If the budget B is not exceeded, for every
unlabeled document x, we use the trained classifier ∆ to predict the probability
P (ŷc |x) ∀c ∈ C, where c represents the sentiment of the document, namely, pos-
itive, negative or neutral. Our method calculates the confidence of the learner’s
prediction I using metric φ, and launches a request for the true label y when
necessary. The oracle provides the true label y only if it is available and the
document x is added to the labeled data for the next iteration. In the event that
the oracle is unavailable, x is not used to adapt the learner. An overview of our
framework is shown in Algorithm 1.
     We assume the cost of labeling is the same for every document at any time
t. If nt is the number of queries sent to the oracle at t, then the utilized budget
at t is given by                        nt