Online Prediction Threshold Optimization Under Semi-deferred
                         Labelling
                         Yorick Spenrath1,* , Marwan Hassani1 and Boudewijn F. van Dongen1
                         1
                             Process Analytics Group, Faculty of Mathematics and Computer Science, Eindhoven University of Technology, The Netherlands


                                             Abstract
                                             In supermarket loyalty campaigns, shoppers collect stamps to redeem limited-time luxury products. Having an accurate prediction
                                             of which shoppers will eventually redeem is crucial to effective execution. With the ultimate goal of changing shopper behavior, it
                                             is important to ensure an adequate number of rewards and to be able to steer promising shoppers into joining the campaign and
                                             redeeming a reward. If information from previous campaigns is available, a prediction model can be built to predict the redemption
                                             probability, possibly also adapting the prediction threshold to determine predicted the label. During a running campaign, we only know
                                             a subset of the labels of the positive class (the so-far redeemers), and have no access to the labels of any example of the negative class
                                             (non-redeemers at the end of the campaign). The majority of the examples during the campaign do not have a label yet (shoppers that
                                             could still redeem but have not done so yet). This is a semi-deferred labelling setting and our goal is to improve the prediction quality
                                             using this limited information. Existing work on predicting (semi-deferred) labels either focuses on positive-unlabelled learning, which
                                             does not use existing models, or updates models after the prediction is made by assigning expected labels using unsupervised learning
                                             models. In this paper we present a framework for Online Prediction threshold optimization Under Semi-deferred labelling (OPUS). Our
                                             framework does not change the existing model, but instead adapts the prediction threshold that decides which probability is required
                                             for a positive label, based on the semi-deferred labels we already know. We apply OPUS to two real-world datasets: a supermarket with
                                             two campaigns and over 160 000 shoppers.

                                             Keywords
                                             Prediction threshold, Online, Semi-deferred labels, Supermarket loyalty campaign


                         1. Introduction                                                                                                                   Known at t                     Unknown at t
                                                                                                                                                 Negative example Unknown example                 Positive example
                         Traditional supervised machine learning projects start with
                         a set of labelled data. Specifically in binary classification,
                         data belongs to one of two classes: positive and negative.
                         Starting from a set of examples with their ground truth label,                                                          I                II                III         IV
                         a predictor is created, which can assess the probability that                                                                          0             t              te

                         an unseen example belongs to the positive class. Without                                                               Offline training Only + labels More + labels         All labels
                         optimization, an example is predicted to be part of the posi-                                                       Figure 1: I) In an offline phase, all labels are present. II) In the
                         tive class if this probability is at least 0.5. One improvement                                                     online phase, this data is potentially outdated, but we do know a
                         is to select a different prediction threshold that distinguishes                                                    limited number of positive labels. III) The set of positive labels
                         examples between the positive and negative class on their                                                           grows as time progresses. IV) At time 𝑡𝑒 , all labels of the online
                         probability. In the presence of labelled data, such a threshold                                                     phase are known. At time 𝑡 we can only use the offline data (I)
                                                                                                                                             and the positive and unlabelled examples (II).
                         can be picked in a way that the predicted labels maximize
                         a given metric. Provided that no concept drift occurs, the
                         model with the adapted threshold can be used indefinitely
                         without degrading prediction quality. In this scenario, we                                                          If the latency becomes too high, models can no longer prop-
                         only use the ‘offline training’ (I) of Figure 1.                                                                    erly be corrected for concept drift, or not even be updated
                            This assumption on the data is however not realistic since                                                       at all if the latency is infinite [3].
                         changes in the data distribution such as concept drift can                                                              In this paper we consider a special class of the latter,
                         reduce the quality of the model. A common technique is to                                                           where we know for some examples that they are positive
                         retrain or update the model at a later point in time, when                                                          and do not know the label of the other examples (they may
                         more recent labelled data is available [1] and potentially                                                          be either positive or negative). This is sketched in Figure 1.
                         also updating the prediction threshold. This training is re-                                                        At time 𝑡, we have the (potentially outdated) offline training
                         ferred to as online training, where new labelled examples                                                           data. We have further seen new examples, of which we
                         are received and can be used to update the model and/or                                                             know some are positive. It is however not until time 𝑡𝑒 that
                         prediction threshold. Such techniques assume that the label                                                         we know the actual labels of all examples, and up to that
                         latency, that is the time between seeing an example and its                                                         point we only receive the actual label of positive examples.
                         true label, is small. Under these conditions prediction sys-                                                        We refer to this as semi-deferred labelling, as we have some
                         tems can react to concept drift as it occurs. This assumption                                                       labelled examples, but only from one class. This still does
                         may not be valid in all real-world scenarios, for example                                                           not allow us to update or retrain the model, neither does
                         because it is computationally or labour intensive to do so [2].                                                     it allow us to set a new threshold that optimizes our target
                                                                                                                                             metric. Setting this new threshold is important, since we
                         Published in the Proceedings of the Workshops of the EDBT/ICDT 2024                                                 still want to make a prediction for all unlabelled examples.
                         Joint Conference (March 25-28, 2024), Paestum, Italy                                                                We therefore introduce OPUS, Online Prediction threshold
                         *
                           Corresponding author.                                                                                             optimization Under Semi-deferred labelling, which aims to
                         $ y.spenrath@tue.nl (Y. Spenrath); m.hassani@tue.nl (M. Hassani);                                                   set a better threshold for an existing prediction model, based
                         b.f.v.dongen@tue.nl (B. F. v. Dongen)
                                                                                                                                             on the limited available positive examples. Note that this
                          0000-0003-0908-9163 (Y. Spenrath); 0000-0002-4027-4351
                         (M. Hassani); 0000-0002-3978-6464 (B. F. v. Dongen)                                                                 is not the same as imbalanced training, as we have only a
                                     © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribu-
                                     tion 4.0 International (CC BY 4.0).
                                                                                                                                             single class of up-to-date data and do not make assumptions

CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
about the possibly outdated data.                                The first is retraining the classifier with an extended training
   Even though OPUS can be generalized to other scenarios,       set, where the new positive examples are added. This can
we discuss a supermarket one in this paper. Shoppers in this     either have limited effect if the number of positive labels is
supermarket make use of loyalty cards which identify them        small or it can heavily bias the prediction model towards the
at each purchase they make. Once a year, the supermarket         positive class if the number of positive labels is high. The
holds a so-called loyalty campaign [4]. During the campaign,     second category builds a model based on the positive and
shoppers may collect stamps. These stamps can be collected       unlabelled state of the examples. The third is to ignore the
by spend (for example one stamp for every 10 monetary            positive labels, and assume all new data is unlabelled. All
units) or in special promotions (an extra stamp for a certain    these techniques make a new prediction model or update
product). If a shopper has collected enough stamps they can      the existing one. We discuss the second and third categories
purchase a reward. Rewards are usually luxurious products        below, including their specific disadvantages in the semi-
available at a much lower (sometimes only symbolic) price.       deferred labelling problem.
Shoppers are as such persuaded to participate in the loyalty
campaign. In this work we consider campaigns with a lim-         2.1. PU-learning
ited time scope: shoppers can only collect the stamps and
redeem rewards within the duration of the campaign. For          The original PU-learning solution was proposed in [5]. Here
such a campaign to work, the supermarket must accurately         a classifier is build using positive and unlabelled examples,
predict how many rewards are required, such that each par-       where the class of the unlabelled examples is not known.
ticipating shopper can actually get their rewards. Having        Semi-deferred labelling is less restrictive in two ways. The
too few rewards means disappointing shoppers, having too         first is that we do have a, potentially outdated, set of la-
many means there will be unusable stock left. It is beneficial   belled data. For the supermarkets, this is a previous loyalty
to make predictions about what shoppers will do during           campaign. The second is that we do not have a single set
the rest of the campaign, in an effort to either steer their     of positive examples but rather receive additional positive
decisions or adapt the available rewards later on.               examples later on. Because Semi-deferred labelling is less
   For our prediction model, we are interested in whether        restrictive, PU-learning can still be applied to our problem
a shopper will make a redemption: shoppers have positive         at different points in time, though it might not be as effective
labels if they redeem a reward during the campaign, or           because it ignores part of the data.
negative labels otherwise. Consider two campaigns, 𝑝1 and           A solution to PU-learning was originally proposed in [5].
𝑝2 . During 𝑝2 we want to predict whether a shopper will         Given a set of examples, of which only a fraction is labelled,
redeem, based on a classifier we learned from the finished       and only from a single class (positive), the authors reason
campaign 𝑝1 . We continuously receive new data and make          that the conditional probability that an example belongs
new predictions and the end of every week. In practice this      to the positive class is equal to the conditional probability
means we get new positive examples in batches, these are all     that the example is labelled, divided by the probability a
consumers that first redeemed in the preceding week. Let 𝑡       positive sample is labelled. From this reasoning, one can
be the moment of prediction, since the start of 𝑝2 . We train    construct a classifier on whether a sample is labelled and
a classifier and select the best threshold. To do so, we split   use that to predict the probability that positive samples are
the data of campaign 𝑝1 into a train and validation set, and     labelled (as average of the classifier outcome for the labelled
train several models on the former and adjust the threshold      examples). For the prediction of the unlabelled examples,
for each model to maximize a target metric using the latter.     the probability given by the classifier is divided by this latter
One of these models has the highest metric value and we          value. An extension of this is discussed in [6], where some
select that as our prediction model. We retrain that model       of its negative effects of misclassification are discussed. PU-
on all shoppers of campaign 𝑝1 , and test it on campaign         learning usually assumes that the examples are labelled
𝑝2 , using the corresponding learned optimal threshold. The      completely at random. In [7] this assumption is abandoned,
problem with this approach is that the model trained on          where the probability of an example being labelled depends
campaign 𝑝1 might not work as well on campaign 𝑝2 . We
also do not get new negative labels during 𝑝2 , which means
that updating our prediction model or creating a new one is
not feasible. The best we can do is to adapt the prediction
threshold based on the limited labels we have available for
𝑝2 . This is the main contribution of this paper, OPUS is a
framework that adapts this threshold, based on the new
labelled data.
   The remainder of this paper is structured as follows. In
Section 2 we discuss existing solutions to the semi-deferred
prediction problem. In Section 3 we formalize the prediction
problem for loyalty campaign participation. In Section 4
we then discuss OPUS. We finally evaluate OPUS against
alternative methods in Section 5 and conclude the paper in       Figure 2: Various Semi-deferred labelling solutions and what
Section 6.                                                       data they use. 1) [Retrain] the model with the positive examples
                                                                 added to the existing dataset 2) [PU-learning] builds a model
                                                                 based on the positive and unlabelled state of the examples 3)
2. Related Work                                                  [Ignore] the positive labels, only using the examples themselves.
                                                                 All three categories ignore part of the data, while OPUS uses all
For this problem there are roughly three existing categories     available data.
of solutions. These are schematically presented in Figure 2.
on its attributes. This is closer to our specific scenario, as    learners can query selected labels from domain experts. An
consumers that are labelled as positive earlier in the loyalty    example of dealing with such conditions in a multiclass
campaign may have different characteristics than consumers        setting is COCEL [2]. An ensemble of one-class classifiers is
labelled later in the loyalty campaign. In [8], the assumption    kept, each predicting only whether an example belongs to
that all labels are accurate is dropped: the authors propose a    its respective class. If none of the classifiers recognizes the
robust method of an ensemble of Support Vector Machines           new example as one of their class, the example is added to a
(SVMs) that can deal with the impurity of positive examples.      buffer. Clusters in this buffer are labelled by domain experts.
For loyalty campaign participation prediction, this is not        Similar to COMPOSE, such solutions are not applicable in
applicable, we have an exact definition of positive labels. In    our scenario, as we cannot know the actual label until the
addition to these works, a more general overview of PU-           end of the loyalty campaign.
learning is discussed in [9].
   What all these techniques have in common is that existing
predictions models are not used or changed. While this does       3. Prediction Task
not invalidate their use in semi-deferred labelling, their more
                                                                  Before we explain how OPUS works, we first discuss the
restrictive assumption does not allow them to benefit from
                                                                  prediction task itself. All symbols used are summarized in
the existing data.
                                                                  Table 1. As introduced in Section 1, we want to solve the
                                                                  problem: “Given the data related to a consumer at time 𝑡
2.2. Ignoring new labels                                          during the loyalty campaign can we predict whether they
A more restrictive assumption on the labels is that no labels     will participate in the loyalty campaign?”. This is a problem
are available, short of an initial training set. This would       of binary classification: for each consumer we want to deter-
be comparable to ignoring newly converted consumers in            mine whether they are a member of one of two classes: the
the second loyalty campaign. Research dealing with such           positive class (participant) or negative class (no participant).
scenarios is often done in a streaming setting, either by pre-    One way to do this is to use a classifier with a prediction
dicting labels based on clustering techniques, or using semi-     threshold. A classifier is a function that assigns a probability
supervised learning methods. In [3], the authors propose a        to the encoding of a consumer. If the probability exceeds the
framework called SCARG. New examples are first classified         prediction threshold, the consumer is predicted to belong to
by an existing classifier. After enough new examples are          the positive class, otherwise they are predicted to belong to
seen, new clusters are formed. The clusters are matched           the negative class. While classifiers can usually be trained to
to existing clusters by their medoids and as such given a         make a prediction between more than two classes, we only
(new) label. Points in the clusters are then used to create a     consider binary classifiers in this paper which, without loss
new prediction model. A similar approach using fuzzy clus-        of generalization, predict the probability that a consumer is
tering [10] is made in [11]. Instead of making predictions        part of a positive class.
using an existing classifier, the COMPOSE framework dis-             Let 𝑀 : 𝒮 → [0, 1] be a function with 𝒮 the set of
cussed in [12] uses a semi-supervised approach. Assuming          all consumers. 𝑀 is a binary classifier that estimates the
an initial set of labels, new unlabelled examples are classi-     probability that a consumer 𝑠 belongs to the positive class.
fied in a semi-supervised manner. Next, only a selection          For a given prediction threshold 𝑡ℎ ∈ [0, 1], consumer 𝑠
of these examples are kept for the next timestep, at which        is predicted to belong to the positive class if and only if
new unlabelled examples are again classified using the semi-      𝑀 (𝑠) > 𝑡ℎ. The binary classifier and the prediction thresh-
supervised learner. The main contribution of COMPOSE              old together assign a predicted binary class, this is the binary
is the way in which the kept examples are selected: using         classification of the consumer. In this paper, the focus is
𝛼-shapes which are compacted until a desired number of            on determining the threshold. However, before we do so,
examples remain.                                                  we explain what data is used to train the classifier for the
   The commonality in these approaches is that they focus         prediction task. This is presented schematically in Figure 3.
on a streaming setting. At specific points in time, one or
more existing models are updated using estimated labels
                                                                  Table 1
from the respective solution. Although the authors show the
                                                                  Symbols used for the prediction task
effectiveness in dealing with streaming data by seeing many
new examples at different timesteps, there are three key dif-      Symbol        Meaning
ferences that make these methods unsuitable for our setting.       𝑡             Time since start of loyalty campaign for which
First, we do not have a constant stream of new examples,                         we train a model/make a prediction
instead we see new information about existing examples at          𝑝𝑖            Loyalty Campaign, where we make predictions
discrete points in time. Second, we are only interested in                       during 𝑝2 using a model trained on 𝑝1
improving a single model once, as we evaluate the existing         𝒮𝑖            Set of consumers that visited during 𝑝𝑖
model several weeks into the loyalty campaign. Third, we           𝒫𝒮 𝑖          Subset of 𝒮𝑖 that redeemed during 𝑝𝑖
do not expect the drift to be gradual; we are starting a whole     𝒩 𝒮𝑖          Subset of 𝒮𝑖 that did not redeem during 𝑝𝑖
new loyalty campaign a year later, so our expected type of         𝑀𝑖𝑡           Binary classification model trained on 𝑝𝑖
drift is more of the abrupt kind. If we were to apply SCARG        𝑡ℎ            Threshold to assign a label based on the proba-
                                                                                 bility predicted by a binary classification model
or fuzzy clustering to predict loyalty campaign participa-
                                                                   𝒞𝒮 𝑡𝑖         Subset of 𝒮𝑖 that redeemed by time 𝑡 in 𝑝1
tion it would effectively be a more elaborate 𝑘NN classifier.
                                                                   𝒰 𝒮 𝑡𝑖        Subset of 𝒮𝑖 that not (yet) redeemed by time 𝑡
COMPOSE is not applicable at all, since there is no way of         𝑜𝑡ℎ𝑡𝑖,𝑗       Optimal threshold based on probabilities com-
knowing the true label of non-participating consumers until                      puted by 𝑀𝑖𝑡 for shoppers in 𝑝𝑗
the end of the loyalty campaign.                                   𝑚𝑒𝑡𝑟𝑖𝑐𝑡𝑖,𝑗    Metric value computed using 𝑜𝑡ℎ𝑡𝑖,𝑗 and proba-
   Another solution to unavailability to unlabelled data is                      bilities computed by 𝑀𝑖𝑡 for shoppers in 𝑝𝑗
active learning, such as in [13]. Instead of having no labels,
                                                                  Table 2
                                                                  Symbols used by OPUS

                                                                   Symbol       Meaning
                                                                   𝑎𝑐𝑝𝑡𝑖,𝑗      The average probability of 𝒞𝒮 𝑡𝑗 predicted by 𝑀𝑖𝑡
                                                                   𝑎𝑛𝑐𝑝𝑡𝑖,𝑗     The average probability of 𝒰 𝒮 𝑡𝑗 predicted by 𝑀𝑖𝑡
                                                                   𝑇 𝑃 𝑀𝑖,𝑗     Threshold prediction model trained on loyalty
                                                                                campaigns 𝑝𝑖 and 𝑝𝑗


                                                                  encoding type, we train a final model, which now includes
                                                                  all consumers from 𝑝1 , and keep the previously computed
                                                                  threshold of that combination. With this new model and the
                                                                  optimal threshold, we can make predictions in the following
Figure 3: Overview of learning the model and optimal threshold    loyalty campaign, 𝑝2 .
in the first loyalty campaign. Since the first loyalty campaign
has ended we know the labels and we can split the consumers in
a positive set 𝒫𝒮 and a negative set 𝒩 𝒮 .                        3.2. Inference phase: the second loyalty
                                                                       campaign
                                                                  With the trained model, we can start making predictions
3.1. Training phase: the first loyalty                            during the second loyalty campaign. At point 𝑡 in 𝑝2 we
     campaign                                                     only have data available up to point 𝑡, so that is what we
                                                                  use in the encoding of consumers. We can identify two sets
We make a prediction at time 𝑡 relative to the start of the       of consumers. The first are those who have redeemed and
loyalty campaign, which ends at 𝑡𝑒 . We start by learning         have a positive label: 𝒞𝒮 𝑡2 . The second are those who have
a classifier from a previous loyalty campaign. For 𝑝1 we          not (yet) redeemed and may or may not still do so: 𝒰 𝒮 𝑡2 .
know which class each consumer that visited the store dur-        Both are indexed by 𝑡, as explained in Section 3 the sets
ing 𝑝1 belongs to, this is whether they redeemed a reward         change over time. Note that 𝒞𝒮 𝑡2 ⊆ 𝒫𝒮 2 , 𝒩 𝒮 2 ⊆ 𝒰𝒮 𝑡2
(𝒫𝒮 1 ) or not (𝒩 𝒮 1 ). The 1 in the subscript indicates that    and 𝒞𝒮 𝑡2 ∪ 𝒰𝒮 𝑡2 = 𝒮2 . Since we cannot distinguish between
these consumer sets belong to loyalty campaign 𝑝1 . We se-        the latter two, we make predictions for them using 𝑀1𝑡 , and
lect a stratified sample of 75% from this set to train a model    we assign the predicted labels using the learned 𝑜𝑡ℎ𝑡1,1 . With
and later use the remaining 25% to optimize the threshold.        the predicted and actual labels (which are available when 𝑝2
Many different types of classifiers exist, but in general the     ends), we can then compute the classification metric. This is
training of a classifier involves finding a function that maxi-   visually presented in Figure 4 in the white area. As discussed
mizes the predicted probability for consumers that belong         previously, we are not able to update the prediction model
to the positive class and minimizing it for consumers that        𝑀1𝑡 itself. However, we can change the threshold we use to
belong to the negative class.                                     assign the predicted class from the predicted probabilities
   The classifier that is trained on data up to time 𝑡 is re-     which is the topic of Section 4.
ferred to as 𝑀1𝑡 . The reason is that we want to use it at time
𝑡 in 𝑝2 , so only data up to that time in 𝑝1 should be used
to train the classifier, to make the encoding of consumers        4. OPUS framework
as similar as possible. This is indicated by the superscript 𝑡.
After training a classifier, we select a prediction threshold.    In this section, we sketch the idea behind OPUS and then
If we set the threshold lower, more consumers will be pre-        formalize it. Table 2 summarizes the new symbols. An
dicted as participants, if we set the threshold higher, fewer     overview of how OPUS adds to the existing prediction prob-
consumers will be predicted as participants. For a given          lem is shown in Figure 4.
binary classification metric and a set of consumers, there is         We start from 𝒞𝒮 𝑡2 and 𝒰𝒮 𝑡2 . As we do not know the
a threshold that can be considered the best, as it results in     labels of the latter, we cannot use these consumers to update
the best metric score. We call this the optimal threshold or      𝑀1𝑡 . We can however compute two characteristics about
𝑜𝑡ℎ𝑡1,1 . The consumers used to optimize the threshold come       the current loyalty campaign 𝑝2 : the average probability
from the 25% that was not used in training. The double 1          computed by 𝑀1𝑡 to each of them. We refer to these as
subscript indicates that it is the optimal threshold for the      the average converted probability 𝑎𝑐𝑝𝑡1,2 and the average
model trained on 𝑝1 and optimized using data from 𝑝1 . The        nonconverted probability 𝑎𝑛𝑐𝑝𝑡1,2 . The subscript 1 indicates
value of the metric for this optimal threshold is denoted by      that 𝑀1𝑡 is used for the model, the subscript 2 that 𝒞𝒮 𝑡2 and
𝑚𝑒𝑡𝑟𝑖𝑐𝑡1,1 .                                                      𝒰𝒮 𝑡2 are used for the consumers. We can do the same for
   Instead of training just one model and just one way of en-     𝑝1 , using 𝒞𝒮 𝑡1 and 𝒰𝒮 𝑡1 with 𝑀1𝑡 to compute 𝑎𝑐𝑝𝑡1,1 and
coding, we can also train multiple different model types and      𝑎𝑛𝑐𝑝𝑡1,1 respectively. In general, for loyalty campaigns
different ways of encoding. We repeat the model training          𝑝𝑖 , 𝑝𝑗 and time 𝑡 we have 𝑎𝑐𝑝𝑡𝑖,𝑗 = 𝜇𝑒∈𝒞𝒮 𝑡𝑗 [𝑀𝑖𝑡 (𝑒)] and
and threshold optimization for each combination of model          𝑎𝑛𝑐𝑝𝑡𝑖,𝑗 = 𝜇𝑒∈𝒰 𝒮 𝑡𝑗 [𝑀𝑖𝑡 (𝑒)]. Note that these values do not
type and encoding type. Each combination then results in a        necessarily say something about the model quality. On the
value for the metric and we select the combination with the       one hand, we expect 𝑎𝑐𝑝𝑡1,2 to be high, since we know that
highest value. This process is referred to as hyper-parameter     it evaluates only positive consumers. On the other hand,
optimization. We discuss the exact model types in Section 5,      we cannot make an expectation for 𝑎𝑛𝑐𝑝𝑡1,2 , as it evaluates
the different encoding types are beyond the scope of this pa-     negative and possibly also positive consumers. One thing
per. After selecting the best combination of model type and
                                                                     4.1. Baseline thresholds
                                                                     For the baselines we consider values that at most use the
                                                                     training data, so we only require data from the first loyalty
                                                                     campaign. Without any alteration, a baseline of 0.5 is often
                                                                     used as standard. We further add the optimal threshold
                                                                     based on the training data, 𝑜𝑡ℎ𝑡1,1 .

                                                                     4.2. Heuristic thresholds
                                                                     For the heuristic thresholds we consider values that also
                                                                     make use of the second loyalty campaign, and which can
                                                                     be used without further knowledge. For this, we consider
                                                                     four linear formulas. The first, ∅𝑎𝑐𝑝, multiplies the optimal
                                                                     thresholds from training, 𝑜𝑡ℎ𝑡1,1 with the ratio between the
                                                                     average converted probabilities, 𝑎𝑐𝑝𝑡1,2 /𝑎𝑐𝑝𝑡1,1 . The idea
                                                                     behind this method is that if the average probability of the
Figure 4: The original prediction task (white area with solid        converted consumers has increased (or decreased), then the
bounds) and how OPUS extends it (gray area with dashed bounds).      model likely overestimates (underestimates) the probability
The set of converted consumers 𝒞𝒮 is now used to tell something      of a consumer being positive. Similarly, we can also use
about the model.                                                     the average non-converter probabilities for this through
                                                                     multiplying 𝑜𝑡ℎ𝑡1,1 by 𝑎𝑛𝑐𝑝𝑡1,2 /𝑎𝑛𝑐𝑝𝑡1,2 . We refer to this as
                                                                     the ∅𝑎𝑛𝑐𝑝 method. Apart from the ratio, we can also take
Table 3                                                              the difference, adding 𝑎𝑐𝑝𝑡1,2 − 𝑎𝑐𝑝𝑡1,1 to 𝑜𝑡ℎ𝑡1,1 , which we
All threshold methods evaluated in this paper
                                                                     refer to as ∆𝑎𝑐𝑝. The ∆𝑎𝑛𝑐𝑝 method is defined in a similar
 Type        Name      Description                                   way. Note that in principle any of these four may result in
                                                                     a value above 1, and the difference based heuristic method
             Regular   No altered threshold (0.5)
 Base


             𝑜𝑡ℎ𝑡𝑖,𝑖   Optimal threshold from training phase         may result in a value below 0. This does not invalidate the
                                                                     threshold, but it means that all entities will be predicted to
             ∅𝑎𝑐𝑝      𝑜𝑡ℎ𝑡𝑖,𝑖 multiplied by the ratio of 𝑎𝑐𝑝        be negative or positive, respectively.
 Heuristic


             ∅𝑎𝑛𝑐𝑝     𝑜𝑡ℎ𝑡𝑖,𝑖 multiplied by the ratio of 𝑎𝑛𝑐𝑝
             Δ𝑎𝑐𝑝      𝑜𝑡ℎ𝑡𝑖,𝑖 increased by the difference of 𝑎𝑐𝑝
             Δ𝑎𝑛𝑐𝑝     𝑜𝑡ℎ𝑡𝑖,𝑖 increased by the difference of 𝑎𝑛𝑐𝑝
                                                                     4.3. Learned thresholds
             DT        𝑇 𝑃 𝑀𝑖,𝑗 , with Decision Tree Regressor       For the learned thresholds we need full information about
 TPM


             RF        𝑇 𝑃 𝑀𝑖,𝑗 , with Random Forrest Regressor      the second loyalty campaign, meaning we can only apply
                                                                     the learned thresholds to a third loyalty campaign. What the
                                                                     heuristic methods have in common is that they use one or
to note is that for 𝑎𝑐𝑝𝑡1,2 we only evaluate consumers that          more of the 𝑎𝑐𝑝, 𝑎𝑛𝑐𝑝 and 𝑜𝑡ℎ values to make an estimation
have already converted by time 𝑡 in 𝑝2 . It might be that for        about the target value (𝑜𝑡ℎ𝑡1,2 ). For the final set of threshold
𝑡 early in the loyalty campaign we are including only a very         estimation methods, we train a regression model, which
specific subset of eager consumers. Such consumers may               we refer to as the threshold prediction model, or 𝑇 𝑃 𝑀 . If
show different behaviour from consumers that converted               needed, we refer to 𝑀1𝑡 as the consumer prediction model
later in the loyalty campaign. As such, 𝑎𝑐𝑝𝑡1,2 may not be an        to distinguish between the two. 𝑇 𝑃 𝑀 takes as descriptive
accurate representation of the average predicted probability         space all known values at 𝑡: the (non-)converted probabili-
of all positive consumers, 𝑎𝑐𝑝𝑡1,2  𝑒
                                      . This is not a problem        ties 𝑎𝑐𝑝𝑡1,1 , 𝑎𝑐𝑝𝑡1,2 , 𝑎𝑛𝑐𝑝𝑡1,1 , 𝑎𝑛𝑐𝑝𝑡1,2 , the optimal training
per se. First, the model is still trained on all consumers, it       threshold 𝑜𝑡ℎ𝑡1,1 , the training metric value 𝑚𝑒𝑡𝑟𝑖𝑐𝑡1,1 , the
therefore also captures behaviour from the consumers that            model specification (type and encoding) and the point in
convert later. Second, we can compare 𝑎𝑐𝑝𝑡1,2 to 𝑎𝑐𝑝𝑡1,1 . As        time 𝑡. This is represented by the red lines Figure 4. As
such, both will be biased towards early redeemers if 𝑡 is            target space, we have the value of 𝑜𝑡ℎ𝑡1,2 . We can sample
small. Therefore, we expect to compare the same type of              these values for different values of 𝑡 and for each of the
consumers. We can therefore use these values to compare              different model specifications. Using two loyalty campaigns,
𝑝1 and 𝑝2 and adapt the prediction threshold.                        we create a training set for 𝑇 𝑃 𝑀 , with the target values
   As we cannot change 𝑀1𝑡 , the best we can do is changing          coming from the second loyalty campaign and the descrip-
the prediction threshold. The best value is the one which            tive features from both loyalty campaigns. We denote this
is computed based on the labelled sets of consumers of the           𝑇 𝑃 𝑀 as 𝑇 𝑃 𝑀1,2 , without the superscript 𝑡 as multiple
second loyalty campaign; 𝒫𝒮 2 and 𝒩 𝒮 2 . In line with the           values of 𝑡 are used to train 𝑇 𝑃 𝑀1,2 . Note that we cannot
previous indexing, this value is 𝑜𝑡ℎ𝑡1,2 . We can only know          use 𝑇 𝑃 𝑀1,2 to estimate a threshold to use during 𝑝2 , as 𝑝2
this value once 𝑝2 ends, so during the loyalty campaign we           needs to be finished before 𝑇 𝑃 𝑀1,2 can be learned. OPUS
need different ways of estimating it. Apart from the Baseline        does not depend on the specific regressor used as a base
methods, we differentiate between Heuristic thresholds and           model for 𝑇 𝑃 𝑀 , in Section 5 we evaluate Decision Tree
Learned thresholds. All methods are presented in Table 3.            and Random Forest regressors.
                                                                        The idea behind the learned thresholds is that 𝑇 𝑃 𝑀1,2
                                                                     learns how to adapt the prediction method (model and
                                                                     threshold) from one loyalty campaign to another. This
                                                                     means we could learn how the prediction method is best
‘transformed’ from 𝑝1 to 𝑝2 , and then apply this learned           the store, using aggregates based on [18] and descriptive
transformation. Suppose that 𝑝1 rewards glassware, 𝑝2 re-           labels based on [19] We finally adopt two strategies: with
wards pans, and a third 𝑝3 rewards kitchen knives. If we            and without adding the converted consumers in the test
learn a transformation from glassware to pans, then we              set to the training set. In either strategy, the prediction
can assess 1) how well does it work on transforming the             quality metrics are only computed on the non-converted
prediction method for a different set of consumers from             consumers in the test set. For each timestep we select the
glassware to pans and 2) how well does that transformation          hyper-parameter combination with the best performance
apply to the transformation from pans to kitchen knives. In         during the training phase using 𝑜𝑡ℎ𝑡1,1 to report the results
the evaluation in Section 5 we limit the evaluation to the          for all threshold estimation methods.
first due to the unavailability of additional data on a third          For each combination of hyper-parameters we compute
loyalty campaign.                                                   𝑎𝑐𝑝𝑡1,𝑖 , 𝑎𝑛𝑐𝑝𝑡1,𝑖 , and 𝑜𝑡ℎ𝑡1,𝑖 for 𝑖 ∈ {1, 2} and each timestep
                                                                    𝑡. On the one hand we can use this to compute all threshold
4.4. Existing Threshold Adaption techniques                         estimation methods discussed in Table 3. On the other hand
                                                                    we can use all combinations to train threshold prediction
OPUS is not the first framework that optimizes thresholds.          models. For each dataset 𝑐𝑥 we compute two 𝑇 𝑃 𝑀 s, one
In GHOST [14], the optimal threshold of any classifier is           using a Decision Tree regressor, 𝐷𝑇𝑥 , and one using a Ran-
optimized on a validation set. The metric value for several         dom Forest regressor 𝑅𝐹𝑥 . These 𝑇 𝑃 𝑀 𝑠 are then used in
thresholds is computed and averaged over several stratified         every other dataset 𝑐𝑦 , 𝑦 ̸= 𝑥, to estimate optimal thresh-
subsets of a validation set. While this technique may prove         olds. We also add two results for comparison. The first uses
effective for finding an optimal threshold for the original         the actual optimal threshold (𝑜𝑡ℎ𝑡1,2 ) and the second uses
training and validation data, there is no update in a later         a prediction model learned on 75% of the data of loyalty
stage, and such an update would require labelled data. As           campaign 𝑝2 and tested on the remaining 25% (Retrain).
such, GHOST is not suitable for the problem at hand. Simi-          Both these methods violate the train-then-test principle but
larly to GHOST, [15] proposes a method to find the optimal          are added as a comparison to how close we can get.
threshold in a training set. The authors prove that for such           During the training phase, models are trained on a strati-
a method only the probabilities of positive class are required      fied 75% split of the consumers that visited the store during
when optimizing for 𝐹1 . While this technique potentially           loyalty campaign 𝑝1 . The remaining 25% are used as test
saves computation time, it is not universally applicable to         set. During the inference phase, models are trained on all
all metrics. In this paper we optimize the thresholds by            consumers that visited the store during loyalty campaign
evaluating all predicted probabilities, and selecting the one       𝑝2 . The consumers that visited the store by time 𝑡 are used
that optimizes the target metric. Finally, [16] proposes a          as test set. The entire test set is used to compute 𝑜𝑡ℎ𝑡1,𝑖
technique to adapt the separating hyperplane position of            using the model trained on the training set. The converted
SVMs to improve its predictive quality. OPUS is different           (non-converted) consumers by time 𝑡 in the test set are used
from these because it tries to optimize the threshold using         to compute 𝑎𝑐𝑝𝑡1,𝑖 (𝑎𝑛𝑐𝑝𝑡1,𝑖 ). The non-converted consumers
the new, partly labelled data in an online fashion. OPUS is         by time 𝑡 are used to compute the model performance for a
designed to improve the predictions under concept drift.            given threshold estimation method.
                                                                       Next to these methods, we also apply PU-learning defined
                                                                    by [5]. For this we use the same hyper-parameters as for
5. Experimental Evaluation                                          the other experiments. We first find the best combination
For the experimental evaluation1 we use data from a real-           based on loyalty campaign 𝑝1 and then report the score for
world retailer containing a total of about 160 000 consumers        that hyper-parameter combination in 𝑝2 , computed over all
shopping in exactly one of two consecutive loyalty cam-             non-converted consumers at the respective point in time.
paigns, 𝑝1 and 𝑝2 . We partition the consumers in ten sep-             We apply the above for both the 𝐹1 and Accuracy metrics.
arate datasets 𝑐1 through 𝑐10 . For each dataset we build           We present two types of results: those for timestep 𝑡 = 4,
several prediction models based on a grid-search over hyper-        separated per dataset, and those for all timesteps, aggregated
parameters and at different timesteps relative to the start of      over the datasets. The reason for separately reporting 𝑡 = 4
the loyalty campaign.                                               comes from domain knowledge, this is the point in the
   Models are trained and tested at multiple points in time         loyalty campaign where additional rewards for the loyalty
relative to the loyalty campaign, after 2, 4, 6, . . . , 18 weeks   campaign can still be ordered by the supermarket.
into the campaign, ending before 𝑡𝑒 . The encoding of con-
sumers is based on the data available from the pre loyalty          5.1. Results for 𝑡 = 4
campaign period and the data during the loyalty campaign
                                                                    Results are presented per split (one per row) and threshold
up to 𝑡. We consider three hyper-parameters: the base
                                                                    estimation method (one per column) in Figure 5. We do
model, the encoding technique, and a strategy. The base
                                                                    not report the values of the combinations where a 𝑇 𝑃 𝑀 is
models are the consumer prediction models, for which we
                                                                    learned from the same split as the data it is tested on (the
use 𝑘-nearest neighbours (𝑘NN), decision tree classifiers
                                                                    white diagonals).
(DTs), support vector classifiers (SVCs) and Adaptive Hoeffd-
                                                                       Baseline methods For the baseline methods we see that
ing Option Trees (AHOTs), all with their default parameters
                                                                    using the regular value is not much better than always pre-
in Scikit-Learn for 𝑘NN, DTs, SVCs, and Scikit-Multiflow
                                                                    dicting True or False, almost all scores being 0.50, which is
[17] for AHOTs, except for setting a constant seeding and
                                                                    effectively the worst value one can get for with balanced ac-
training the SVCs on probabilities instead of labels. The en-
                                                                    curacy in a binary classification. Using the optimal training
coding techniques define what type of features we use about
                                                                    threshold 𝑜𝑡ℎ𝑡𝑖,𝑖 works much better, with the exception of
a consumer. The features are based on individual visits to
                                                                    𝑐5 , 𝑐6 and 𝑐7 .
1
    For the implementation, see github.com/YorickSpenrath/opus
       BL       Heuristic                                          TPM
c0   .50 .82 .81 .82 .50 .82 .82 .83 .83 .83 .86 .83 .83 .82 .82 .82 .83 .86 .83 .83 .83 .83 .82 .71 .86 .86 .68
c1   .50 .80 .80 .80 .66 .80 .68     .80 .83 .83 .82 .82 .80 .80 .80 .82 .80 .83 .82 .80 .82 .80 .80 .80 .83 .85 .50
c2   .50 .82 .82 .82 .50 .82 .82 .82 .85 .82 .85 .85 .82 .82 .82 .82 .82         .82 .85 .82 .82 .82 .82 .82 .85 .87
c3   .55 .77 .78 .78 .84 .77 .84 .84 .84 .77 .77 .84 .78 .81 .84 .84 .84 .84 .81 .81 .81 .78 .78 .77 .84 .87 .82
c4   .50 .82 .82 .82 .72 .82 .52 .84 .82 .84 .84 .52 .82 .84 .82 .84 .84 .82 .84 .84 .84 .82 .84 .82 .85 .86 .51
c5   .50 .68 .81 .81 .81 .68 .81 .83 .85 .83 .81 .83 .68 .83 .83 .85 .85 .85 .85 .85 .85 .81 .83 .81 .85 .82
c6   .50 .68 .68 .81 .68 .68 .86 .86 .86 .81 .86 .68 .81 .68 .86 .81 .81 .86 .81 .86 .81 .81 .68 .81 .87 .88 .65
c7   .50 .66 .66 .66 .66 .66 .66 .84 .81 .84 .84 .84 .84 .84 .66 .84 .84 .84 .84 .84 .84 .84 .84 .81 .84 .87
c8   .50 .86 .86 .86 .50 .50 .86 .86 .86 .86 .77 .77 .86 .77     .77 .86 .86 .86 .86 .86 .86 .86 .86 .77 .86 .86
c9   .50 .80 .80 .80 .80 .80 .80 .80 .80 .80 .80 .80 .80 .80 .80     .80 .80 .80 .80 .80 .80 .80 .80 .80     .80 .85 .78
     Regular


     Retrain
        acp
       ancp

       ancp
       othi,t i


       othi,t j
        RF0
        RF1
        RF2
        RF3
        RF4
        RF5
        RF6
        RF7
        RF8
        RF9
        acp


         PU
        DT0
        DT1
        DT2
        DT3
        DT4
        DT5
        DT6
        DT7
        DT8
        DT9
          Figure 5: Accuracy at timestep 4. Values shown in bold are strictly better than the 𝑜𝑡ℎ𝑡𝑖,𝑖 , the ones in italic are strictly smaller.
          Higher values correspond to a brighter colour. The white diagonals are skipped since those the 𝑇 𝑃 𝑀 would be trained on the
          same split we test it on. Some scores for PU-learning are left out; for these splits there were not enough converted consumers
          to train a meaningful prediction model.


   Heuristic methods For the heuristic methods we have                    describes the benefit of using the learned method in OPUS
that, except for (𝑐8 , ∆𝑎𝑛𝑐𝑝), the 𝑎𝑛𝑐𝑝 methods consistently              in general. We see what one can on average expect using
performs as well or better than the baseline 𝑜𝑡ℎ𝑡𝑖,𝑖 . The 𝑎𝑐𝑝            any other dataset to train the 𝑇 𝑃 𝑀 to estimate the opti-
values do much worse however, especially for the ∆𝑎𝑐𝑝.                    mal threshold is beneficial. What is more, mainly for the
As explained in Section 4.2, if the difference between 𝑎𝑐𝑝𝑡𝑖,𝑖            Random Forest 𝑇 𝑃 𝑀 , we see that we get values that are
and 𝑎𝑐𝑝𝑡𝑖,𝑗 is too high, this method might end up predicting              close to using the actual optimal threshold 𝑜𝑡ℎ𝑡𝑖,𝑗 . For the
all shoppers as redeemers or all shoppers as non-redeemers.               heuristic thresholds, we see that the ratio-based ones (∅)
This is what happens for splits 0, 2 and 8.                               are performing close to or better than the 𝑜𝑡ℎ𝑡𝑖,𝑖 baseline.
   Learned methods With only a few exceptions, the                        For PU-learning we see that the results improve over time
learned methods always perform at least as good as the                    as more positive labels are known. This is to be expected,
baseline and often beat it. Furthermore, for most splits the              though it does not yet catch with OPUS methods.
best values even come close to the results for 𝑜𝑡ℎ𝑖,𝑗 , the                  F1 We see something similar to the accuracy results. This
value which we aim to estimate. Next to this, we see that                 is expected, as we are penalizing in a similar way by com-
the Random Forest 𝑇 𝑃 𝑀 performs almost always as well                    paring the actual with the predicted labels for each shopper.
as or better than the Decision Tree based 𝑇 𝑃 𝑀 , compared                There are two important differences. The first is that we
on the same combination of splits. This is to be expected                 see lower values. This is because 𝐹1 does not account for
given the complexity of the Random Forest model and that                  class imbalance as we did for the balanced accuracy score.
it can as such capture more difficult relations.                          Given that the redeemer class is much smaller than the non-
   PU-Learning One weakness of PU-learning is if the num-                 redeemer class, the reported values for 𝑜𝑡ℎ𝑡𝑖,𝑗 and Retrain
ber of known positive labels is small. This results in a base             are reasonable. This is also seen by the non 𝑜𝑡ℎ𝑡𝑖,𝑖 baselines,
prediction model that cannot predict positive points, result-             which have a much lower score. The second is that we are
ing in an average probability of 0 for the known positive                 further off from the optimal thresholds 𝑜𝑡ℎ𝑡𝑖,𝑗 . This is likely
labels. In such a scenario, PU-learning fails to produce any              because of a similar reason: the 𝐹1 is much more penalizing
result. This is what happens in 4 of the splits. Note that                for imbalanced datasets, and as such, being able to use the
OPUS does not have this problem, as we are using the la-                  optimal thresholds/all labels allows for a much better score
belled data from last loyalty campaign. In the splits where               than estimating it using any of the methods from OPUS.
there are enough known positive labels, it is still outper-               Like accuracy, the results of PU-learning improve over time,
formed by OPUS, for a similar reason.                                     eventually even improving over OPUS methods. The lower
                                                                          accuracy and higher 𝐹1 for PU-learning is likely caused by
5.2. Results aggregated per timestep                                      the unbalanced dataset as well.

We next aggregate the results over the splits for each
timestep in Figures 6a and 6b. We report the mean and                      6. Conclusion
95% confidence interval of the metric for every timestep
(each row). For most estimation methods (each column) this                 In this paper we have presented OPUS, a framework to es-
is the average over the 10 values from different splits, for the           timate a better prediction threshold when only one class
learned estimation methods we consider all combinations.                   of the target labels is available. The advantage of such
In other words, the reported aggregated values for 𝐷𝑇 and                  optimizations is that we can adapt existing prediction mod-
𝑅𝐹 are taken over all 90 combinations.                                     els without having to rely on fully labelled data. This is a
   Accuracy Some results from Section 5.1 can be trans-                    more realistic assumption when the true label of one class
ferred to more timesteps. The learned thresholds seem to                   is deferred. OPUS makes use of two types of threshold esti-
perform better than using 𝑜𝑡ℎ𝑡𝑖,𝑖 . Note that, as we have                  mation methods, where learned methods are more effective
averaged over both the splits used for the evaluation as                   in getting to an optimal prediction threshold at the cost of
well as the splits used for computing the 𝑇 𝑃 𝑀 , this result              requiring more data (only being available at a third cam-
           Baseline                   Heuristic                    TPM            Comparison
 4    .50 ±.01 .77 ±.04 .79 ±.04 .80 ±.03 .67 ±.08 .74 ±.06 .81 ±.01 .82 ±.01 .85 ±.01 .86 ±.01 .66 ±.10
 8    .53 ±.02 .80 ±.03 .80 ±.03 .80 ±.04 .65 ±.10 .78 ±.06 .81 ±.01 .82 ±.01 .84 ±.02 .88 ±.01 .73 ±.03
12    .53 ±.03 .78 ±.04 .76 ±.05 .76 ±.05 .68 ±.08 .79 ±.04 .78 ±.02 .80 ±.02 .83 ±.03 .86 ±.01 .74 ±.03
16    .60 ±.04 .79 ±.06 .77 ±.07 .77 ±.05 .72 ±.09 .76 ±.07 .80 ±.01 .81 ±.01 .83 ±.03 .89 ±.01 .73 ±.04
                                                                (a) Accuracy
 4    .03 ±.03 .14 ±.09 .17 ±.09 .20 ±.09 .21 ±.09 .12 ±.09 .25 ±.03 .29 ±.02 .34 ±.03 .40 ±.02 .15 ±.08
 8    .09 ±.05 .19 ±.08 .22 ±.09 .20 ±.08 .12 ±.06 .20 ±.08 .20 ±.03 .22 ±.03 .33 ±.04 .39 ±.03 .34 ±.03
12    .06 ±.05 .21 ±.06 .18 ±.07 .19 ±.07 .15 ±.06 .19 ±.07 .20 ±.02 .20 ±.02 .25 ±.05 .35 ±.02 .32 ±.01
16    .07 ±.04 .11 ±.05 .10 ±.05 .09 ±.05 .11 ±.05 .11 ±.05 .15 ±.01 .17 ±.01 .20 ±.02 .31 ±.03 .25 ±.01
        Regular


                                                                                     DT


                                                                                                 RF


                                                                                                                        Retrain
                                                          acp
                                             ancp


                                                                       ancp
                     othi,t i


                                                                                                            othi,t j
                                 acp


                                                                                                                                    PU
                                                                     (b) 𝐹1
           Figure 6: a) Accuracy and b) 𝐹1 at each timestep, averaged over the splits with their 95% confidence interval. Values shown
           in bold are strictly better than the 𝑜𝑡ℎ𝑡𝑖,𝑖 , the ones in italic are strictly worse (in terms of the reported mean). Better values
           correspond to a brighter colour.


paign), compared to heuristic methods that do not have                             robust ensemble approach to learn from positive and
these benefits or limitations. While the initial results are                       unlabeled data using SVM base models, Neurocom-
promising, we want to elaborate the experimental analysis,                         puting 160 (2015) 73–84.
both by including more than two loyalty campaigns and                          [9] J. Bekker, J. Davis, Learning from positive and unla-
by applying OPUS on a different use case inside the student                        beled data: a survey, Machine Learning 109 (2020)
learning domain. In the design of OPUS only characteristics                        719–760.
that belong to the current timestep are considered, not of                    [10] P. De Abreu Lopes, H. De Arruda Camargo, Fuz-
all previous timesteps. We argue that the progression of                           zStream: Fuzzy data stream clustering based on the
these characteristics, as well as the model performance over                       online-offline framework, IEEE International Confer-
time, and how they compare to the same period in the pre-                          ence on Fuzzy Systems (2017).
vious loyalty campaign can be interesting to improve the                      [11] T. P. da Silva, V. M. A. Souza, G. E. A. P. A. Batista,
framework in future work.                                                          H. de Arruda Camargo, A fuzzy classifier for data
                                                                                   streams with infinitely delayed labels, LNCS 11401
                                                                                   (2019) 287–295.
References                                                                    [12] K. B. Dyer, R. Capo, R. Polikar, Compose: A semisuper-
                                                                                   vised learning framework for initially labeled nonsta-
 [1] W. Rizzi, C. Di Francescomarino, C. Ghidini, F. M.
                                                                                   tionary streaming data, IEEE Transactions on Neural
     Maggi, How do I update my model? On the resilience
                                                                                   Networks and Learning Systems 25 (2014) 12–26.
     of Predictive Process Monitoring models to change,
                                                                              [13] B. Settles, Active Learning Literature Survey, Techni-
     KAIS 64 (2022) 1385–1416.
                                                                                   cal Report, University of Wisconsin-Madison, 2009.
 [2] C. Fahy, S. Yang, M. Gongora, Classification in Dy-
                                                                              [14] C. Esposito, G. A. Landrum, N. Schneider, N. Stiefl,
     namic Data Streams with a Scarcity of Labels, IEEE
                                                                                   S. Riniker, GHOST: Adjusting the Decision Threshold
     TKDE (2021).
                                                                                   to Handle Imbalanced Data in Machine Learning, Jour-
 [3] V. M. A. Souza, D. F. Silva, J. Gama, G. E. A. P. A. Batista,
                                                                                   nal of Chemical Information and Modeling 61 (2021)
     Data Stream Classification Guided by Clustering on
                                                                                   2623–2640.
     Nonstationary Environments and Extreme Verification
                                                                              [15] Q. Zou, S. Xie, Z. Lin, M. Wu, Y. Ju, Finding the Best
     Latency, in: SDM, 2015, pp. 873–881.
                                                                                   Classification Threshold in Imbalanced Classification,
 [4] N. J. Bombaij, S. Gelper, M. G. Dekimpe, Designing suc-
                                                                                   Big Data Research 5 (2016) 2–8.
     cessful temporary loyalty programs: An exploratory
                                                                              [16] H. Yu, C. Mu, C. Sun, W. Yang, X. Yang, X. Zuo, Support
     study on retailer and country differences, International
                                                                                   vector machine-based optimized decision threshold
     Journal of Research in Marketing 39 (2022) 1275–1295.
                                                                                   adjustment strategy for classifying imbalanced data,
 [5] C. Elkan, K. Noto, Learning classifiers from only posi-
                                                                                   Knowledge-Based Systems 76 (2015) 67–78.
     tive and unlabeled data, KDD (2008) 213–220.
                                                                              [17] J. Montiel, J. Read, A. Bifet, T. Abdessalem, Scikit-
 [6] J. Peeperkorn, C. O. Vázquez, A. Stevens, J. D. Smedt,
                                                                                   Multiflow: A Multi-output Streaming Framework,
     S. Broucke, J. D. Weerdt, Outcome-Oriented Predictive
                                                                                   JMLR 19 (2018) 1–5.
     Process Monitoring on Positive and Unlabelled Event
                                                                              [18] Y. Spenrath, M. Hassani, B. F. van Dongen, Online
     Logs, ICPM workshops (2022).
                                                                                   prediction of aggregated retailer consumer behaviour,
 [7] J. Bekker, P. Robberechts, J. Davis, Beyond the Selected
                                                                                   ICPM Workshops (2021).
     Completely at Random Assumption for Learning from
                                                                              [19] Y. Spenrath, M. Hassani, B. v. Dongen, H. Tariq, Learn-
     Positive and Unlabeled Data, LNCS 11907 LNAI (2020)
                                                                                   ing an efficient distance metric for retailer transaction
     71–85.
                                                                                   data, in: PKDD, 2020, pp. 323–338.
 [8] M. Claesen, F. De Smet, J. A. Suykens, B. De Moor, A