=Paper=
{{Paper
|id=Vol-3651/DARLI-AP_paper10
|storemode=property
|title=Online Prediction Threshold Optimization Under Semi-deferred Labelling
|pdfUrl=https://ceur-ws.org/Vol-3651/DARLI-AP-10.pdf
|volume=Vol-3651
|authors=Yorick Spenrath,Marwan Hassani,Boudewijn Frans Van Dongen
|dblpUrl=https://dblp.org/rec/conf/edbt/SpenrathHD24
}}
==Online Prediction Threshold Optimization Under Semi-deferred Labelling==
Online Prediction Threshold Optimization Under Semi-deferred
Labelling
Yorick Spenrath1,* , Marwan Hassani1 and Boudewijn F. van Dongen1
1
Process Analytics Group, Faculty of Mathematics and Computer Science, Eindhoven University of Technology, The Netherlands
Abstract
In supermarket loyalty campaigns, shoppers collect stamps to redeem limited-time luxury products. Having an accurate prediction
of which shoppers will eventually redeem is crucial to effective execution. With the ultimate goal of changing shopper behavior, it
is important to ensure an adequate number of rewards and to be able to steer promising shoppers into joining the campaign and
redeeming a reward. If information from previous campaigns is available, a prediction model can be built to predict the redemption
probability, possibly also adapting the prediction threshold to determine predicted the label. During a running campaign, we only know
a subset of the labels of the positive class (the so-far redeemers), and have no access to the labels of any example of the negative class
(non-redeemers at the end of the campaign). The majority of the examples during the campaign do not have a label yet (shoppers that
could still redeem but have not done so yet). This is a semi-deferred labelling setting and our goal is to improve the prediction quality
using this limited information. Existing work on predicting (semi-deferred) labels either focuses on positive-unlabelled learning, which
does not use existing models, or updates models after the prediction is made by assigning expected labels using unsupervised learning
models. In this paper we present a framework for Online Prediction threshold optimization Under Semi-deferred labelling (OPUS). Our
framework does not change the existing model, but instead adapts the prediction threshold that decides which probability is required
for a positive label, based on the semi-deferred labels we already know. We apply OPUS to two real-world datasets: a supermarket with
two campaigns and over 160 000 shoppers.
Keywords
Prediction threshold, Online, Semi-deferred labels, Supermarket loyalty campaign
1. Introduction Known at t Unknown at t
Negative example Unknown example Positive example
Traditional supervised machine learning projects start with
a set of labelled data. Specifically in binary classification,
data belongs to one of two classes: positive and negative.
Starting from a set of examples with their ground truth label, I II III IV
a predictor is created, which can assess the probability that 0 t te
an unseen example belongs to the positive class. Without Offline training Only + labels More + labels All labels
optimization, an example is predicted to be part of the posi- Figure 1: I) In an offline phase, all labels are present. II) In the
tive class if this probability is at least 0.5. One improvement online phase, this data is potentially outdated, but we do know a
is to select a different prediction threshold that distinguishes limited number of positive labels. III) The set of positive labels
examples between the positive and negative class on their grows as time progresses. IV) At time ๐ก๐ , all labels of the online
probability. In the presence of labelled data, such a threshold phase are known. At time ๐ก we can only use the offline data (I)
and the positive and unlabelled examples (II).
can be picked in a way that the predicted labels maximize
a given metric. Provided that no concept drift occurs, the
model with the adapted threshold can be used indefinitely
without degrading prediction quality. In this scenario, we If the latency becomes too high, models can no longer prop-
only use the โoffline trainingโ (I) of Figure 1. erly be corrected for concept drift, or not even be updated
This assumption on the data is however not realistic since at all if the latency is infinite [3].
changes in the data distribution such as concept drift can In this paper we consider a special class of the latter,
reduce the quality of the model. A common technique is to where we know for some examples that they are positive
retrain or update the model at a later point in time, when and do not know the label of the other examples (they may
more recent labelled data is available [1] and potentially be either positive or negative). This is sketched in Figure 1.
also updating the prediction threshold. This training is re- At time ๐ก, we have the (potentially outdated) offline training
ferred to as online training, where new labelled examples data. We have further seen new examples, of which we
are received and can be used to update the model and/or know some are positive. It is however not until time ๐ก๐ that
prediction threshold. Such techniques assume that the label we know the actual labels of all examples, and up to that
latency, that is the time between seeing an example and its point we only receive the actual label of positive examples.
true label, is small. Under these conditions prediction sys- We refer to this as semi-deferred labelling, as we have some
tems can react to concept drift as it occurs. This assumption labelled examples, but only from one class. This still does
may not be valid in all real-world scenarios, for example not allow us to update or retrain the model, neither does
because it is computationally or labour intensive to do so [2]. it allow us to set a new threshold that optimizes our target
metric. Setting this new threshold is important, since we
Published in the Proceedings of the Workshops of the EDBT/ICDT 2024 still want to make a prediction for all unlabelled examples.
Joint Conference (March 25-28, 2024), Paestum, Italy We therefore introduce OPUS, Online Prediction threshold
*
Corresponding author. optimization Under Semi-deferred labelling, which aims to
$ y.spenrath@tue.nl (Y. Spenrath); m.hassani@tue.nl (M. Hassani); set a better threshold for an existing prediction model, based
b.f.v.dongen@tue.nl (B. F. v. Dongen)
on the limited available positive examples. Note that this
0000-0003-0908-9163 (Y. Spenrath); 0000-0002-4027-4351
(M. Hassani); 0000-0002-3978-6464 (B. F. v. Dongen) is not the same as imbalanced training, as we have only a
ยฉ 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribu-
tion 4.0 International (CC BY 4.0).
single class of up-to-date data and do not make assumptions
CEUR
ceur-ws.org
Workshop ISSN 1613-0073
Proceedings
about the possibly outdated data. The first is retraining the classifier with an extended training
Even though OPUS can be generalized to other scenarios, set, where the new positive examples are added. This can
we discuss a supermarket one in this paper. Shoppers in this either have limited effect if the number of positive labels is
supermarket make use of loyalty cards which identify them small or it can heavily bias the prediction model towards the
at each purchase they make. Once a year, the supermarket positive class if the number of positive labels is high. The
holds a so-called loyalty campaign [4]. During the campaign, second category builds a model based on the positive and
shoppers may collect stamps. These stamps can be collected unlabelled state of the examples. The third is to ignore the
by spend (for example one stamp for every 10 monetary positive labels, and assume all new data is unlabelled. All
units) or in special promotions (an extra stamp for a certain these techniques make a new prediction model or update
product). If a shopper has collected enough stamps they can the existing one. We discuss the second and third categories
purchase a reward. Rewards are usually luxurious products below, including their specific disadvantages in the semi-
available at a much lower (sometimes only symbolic) price. deferred labelling problem.
Shoppers are as such persuaded to participate in the loyalty
campaign. In this work we consider campaigns with a lim- 2.1. PU-learning
ited time scope: shoppers can only collect the stamps and
redeem rewards within the duration of the campaign. For The original PU-learning solution was proposed in [5]. Here
such a campaign to work, the supermarket must accurately a classifier is build using positive and unlabelled examples,
predict how many rewards are required, such that each par- where the class of the unlabelled examples is not known.
ticipating shopper can actually get their rewards. Having Semi-deferred labelling is less restrictive in two ways. The
too few rewards means disappointing shoppers, having too first is that we do have a, potentially outdated, set of la-
many means there will be unusable stock left. It is beneficial belled data. For the supermarkets, this is a previous loyalty
to make predictions about what shoppers will do during campaign. The second is that we do not have a single set
the rest of the campaign, in an effort to either steer their of positive examples but rather receive additional positive
decisions or adapt the available rewards later on. examples later on. Because Semi-deferred labelling is less
For our prediction model, we are interested in whether restrictive, PU-learning can still be applied to our problem
a shopper will make a redemption: shoppers have positive at different points in time, though it might not be as effective
labels if they redeem a reward during the campaign, or because it ignores part of the data.
negative labels otherwise. Consider two campaigns, ๐1 and A solution to PU-learning was originally proposed in [5].
๐2 . During ๐2 we want to predict whether a shopper will Given a set of examples, of which only a fraction is labelled,
redeem, based on a classifier we learned from the finished and only from a single class (positive), the authors reason
campaign ๐1 . We continuously receive new data and make that the conditional probability that an example belongs
new predictions and the end of every week. In practice this to the positive class is equal to the conditional probability
means we get new positive examples in batches, these are all that the example is labelled, divided by the probability a
consumers that first redeemed in the preceding week. Let ๐ก positive sample is labelled. From this reasoning, one can
be the moment of prediction, since the start of ๐2 . We train construct a classifier on whether a sample is labelled and
a classifier and select the best threshold. To do so, we split use that to predict the probability that positive samples are
the data of campaign ๐1 into a train and validation set, and labelled (as average of the classifier outcome for the labelled
train several models on the former and adjust the threshold examples). For the prediction of the unlabelled examples,
for each model to maximize a target metric using the latter. the probability given by the classifier is divided by this latter
One of these models has the highest metric value and we value. An extension of this is discussed in [6], where some
select that as our prediction model. We retrain that model of its negative effects of misclassification are discussed. PU-
on all shoppers of campaign ๐1 , and test it on campaign learning usually assumes that the examples are labelled
๐2 , using the corresponding learned optimal threshold. The completely at random. In [7] this assumption is abandoned,
problem with this approach is that the model trained on where the probability of an example being labelled depends
campaign ๐1 might not work as well on campaign ๐2 . We
also do not get new negative labels during ๐2 , which means
that updating our prediction model or creating a new one is
not feasible. The best we can do is to adapt the prediction
threshold based on the limited labels we have available for
๐2 . This is the main contribution of this paper, OPUS is a
framework that adapts this threshold, based on the new
labelled data.
The remainder of this paper is structured as follows. In
Section 2 we discuss existing solutions to the semi-deferred
prediction problem. In Section 3 we formalize the prediction
problem for loyalty campaign participation. In Section 4
we then discuss OPUS. We finally evaluate OPUS against
alternative methods in Section 5 and conclude the paper in Figure 2: Various Semi-deferred labelling solutions and what
Section 6. data they use. 1) [Retrain] the model with the positive examples
added to the existing dataset 2) [PU-learning] builds a model
based on the positive and unlabelled state of the examples 3)
2. Related Work [Ignore] the positive labels, only using the examples themselves.
All three categories ignore part of the data, while OPUS uses all
For this problem there are roughly three existing categories available data.
of solutions. These are schematically presented in Figure 2.
on its attributes. This is closer to our specific scenario, as learners can query selected labels from domain experts. An
consumers that are labelled as positive earlier in the loyalty example of dealing with such conditions in a multiclass
campaign may have different characteristics than consumers setting is COCEL [2]. An ensemble of one-class classifiers is
labelled later in the loyalty campaign. In [8], the assumption kept, each predicting only whether an example belongs to
that all labels are accurate is dropped: the authors propose a its respective class. If none of the classifiers recognizes the
robust method of an ensemble of Support Vector Machines new example as one of their class, the example is added to a
(SVMs) that can deal with the impurity of positive examples. buffer. Clusters in this buffer are labelled by domain experts.
For loyalty campaign participation prediction, this is not Similar to COMPOSE, such solutions are not applicable in
applicable, we have an exact definition of positive labels. In our scenario, as we cannot know the actual label until the
addition to these works, a more general overview of PU- end of the loyalty campaign.
learning is discussed in [9].
What all these techniques have in common is that existing
predictions models are not used or changed. While this does 3. Prediction Task
not invalidate their use in semi-deferred labelling, their more
Before we explain how OPUS works, we first discuss the
restrictive assumption does not allow them to benefit from
prediction task itself. All symbols used are summarized in
the existing data.
Table 1. As introduced in Section 1, we want to solve the
problem: โGiven the data related to a consumer at time ๐ก
2.2. Ignoring new labels during the loyalty campaign can we predict whether they
A more restrictive assumption on the labels is that no labels will participate in the loyalty campaign?โ. This is a problem
are available, short of an initial training set. This would of binary classification: for each consumer we want to deter-
be comparable to ignoring newly converted consumers in mine whether they are a member of one of two classes: the
the second loyalty campaign. Research dealing with such positive class (participant) or negative class (no participant).
scenarios is often done in a streaming setting, either by pre- One way to do this is to use a classifier with a prediction
dicting labels based on clustering techniques, or using semi- threshold. A classifier is a function that assigns a probability
supervised learning methods. In [3], the authors propose a to the encoding of a consumer. If the probability exceeds the
framework called SCARG. New examples are first classified prediction threshold, the consumer is predicted to belong to
by an existing classifier. After enough new examples are the positive class, otherwise they are predicted to belong to
seen, new clusters are formed. The clusters are matched the negative class. While classifiers can usually be trained to
to existing clusters by their medoids and as such given a make a prediction between more than two classes, we only
(new) label. Points in the clusters are then used to create a consider binary classifiers in this paper which, without loss
new prediction model. A similar approach using fuzzy clus- of generalization, predict the probability that a consumer is
tering [10] is made in [11]. Instead of making predictions part of a positive class.
using an existing classifier, the COMPOSE framework dis- Let ๐ : ๐ฎ โ [0, 1] be a function with ๐ฎ the set of
cussed in [12] uses a semi-supervised approach. Assuming all consumers. ๐ is a binary classifier that estimates the
an initial set of labels, new unlabelled examples are classi- probability that a consumer ๐ belongs to the positive class.
fied in a semi-supervised manner. Next, only a selection For a given prediction threshold ๐กโ โ [0, 1], consumer ๐
of these examples are kept for the next timestep, at which is predicted to belong to the positive class if and only if
new unlabelled examples are again classified using the semi- ๐ (๐ ) > ๐กโ. The binary classifier and the prediction thresh-
supervised learner. The main contribution of COMPOSE old together assign a predicted binary class, this is the binary
is the way in which the kept examples are selected: using classification of the consumer. In this paper, the focus is
๐ผ-shapes which are compacted until a desired number of on determining the threshold. However, before we do so,
examples remain. we explain what data is used to train the classifier for the
The commonality in these approaches is that they focus prediction task. This is presented schematically in Figure 3.
on a streaming setting. At specific points in time, one or
more existing models are updated using estimated labels
Table 1
from the respective solution. Although the authors show the
Symbols used for the prediction task
effectiveness in dealing with streaming data by seeing many
new examples at different timesteps, there are three key dif- Symbol Meaning
ferences that make these methods unsuitable for our setting. ๐ก Time since start of loyalty campaign for which
First, we do not have a constant stream of new examples, we train a model/make a prediction
instead we see new information about existing examples at ๐๐ Loyalty Campaign, where we make predictions
discrete points in time. Second, we are only interested in during ๐2 using a model trained on ๐1
improving a single model once, as we evaluate the existing ๐ฎ๐ Set of consumers that visited during ๐๐
model several weeks into the loyalty campaign. Third, we ๐ซ๐ฎ ๐ Subset of ๐ฎ๐ that redeemed during ๐๐
do not expect the drift to be gradual; we are starting a whole ๐ฉ ๐ฎ๐ Subset of ๐ฎ๐ that did not redeem during ๐๐
new loyalty campaign a year later, so our expected type of ๐๐๐ก Binary classification model trained on ๐๐
drift is more of the abrupt kind. If we were to apply SCARG ๐กโ Threshold to assign a label based on the proba-
bility predicted by a binary classification model
or fuzzy clustering to predict loyalty campaign participa-
๐๐ฎ ๐ก๐ Subset of ๐ฎ๐ that redeemed by time ๐ก in ๐1
tion it would effectively be a more elaborate ๐NN classifier.
๐ฐ ๐ฎ ๐ก๐ Subset of ๐ฎ๐ that not (yet) redeemed by time ๐ก
COMPOSE is not applicable at all, since there is no way of ๐๐กโ๐ก๐,๐ Optimal threshold based on probabilities com-
knowing the true label of non-participating consumers until puted by ๐๐๐ก for shoppers in ๐๐
the end of the loyalty campaign. ๐๐๐ก๐๐๐๐ก๐,๐ Metric value computed using ๐๐กโ๐ก๐,๐ and proba-
Another solution to unavailability to unlabelled data is bilities computed by ๐๐๐ก for shoppers in ๐๐
active learning, such as in [13]. Instead of having no labels,
Table 2
Symbols used by OPUS
Symbol Meaning
๐๐๐๐ก๐,๐ The average probability of ๐๐ฎ ๐ก๐ predicted by ๐๐๐ก
๐๐๐๐๐ก๐,๐ The average probability of ๐ฐ ๐ฎ ๐ก๐ predicted by ๐๐๐ก
๐ ๐ ๐๐,๐ Threshold prediction model trained on loyalty
campaigns ๐๐ and ๐๐
encoding type, we train a final model, which now includes
all consumers from ๐1 , and keep the previously computed
threshold of that combination. With this new model and the
optimal threshold, we can make predictions in the following
Figure 3: Overview of learning the model and optimal threshold loyalty campaign, ๐2 .
in the first loyalty campaign. Since the first loyalty campaign
has ended we know the labels and we can split the consumers in
a positive set ๐ซ๐ฎ and a negative set ๐ฉ ๐ฎ . 3.2. Inference phase: the second loyalty
campaign
With the trained model, we can start making predictions
3.1. Training phase: the first loyalty during the second loyalty campaign. At point ๐ก in ๐2 we
campaign only have data available up to point ๐ก, so that is what we
use in the encoding of consumers. We can identify two sets
We make a prediction at time ๐ก relative to the start of the of consumers. The first are those who have redeemed and
loyalty campaign, which ends at ๐ก๐ . We start by learning have a positive label: ๐๐ฎ ๐ก2 . The second are those who have
a classifier from a previous loyalty campaign. For ๐1 we not (yet) redeemed and may or may not still do so: ๐ฐ ๐ฎ ๐ก2 .
know which class each consumer that visited the store dur- Both are indexed by ๐ก, as explained in Section 3 the sets
ing ๐1 belongs to, this is whether they redeemed a reward change over time. Note that ๐๐ฎ ๐ก2 โ ๐ซ๐ฎ 2 , ๐ฉ ๐ฎ 2 โ ๐ฐ๐ฎ ๐ก2
(๐ซ๐ฎ 1 ) or not (๐ฉ ๐ฎ 1 ). The 1 in the subscript indicates that and ๐๐ฎ ๐ก2 โช ๐ฐ๐ฎ ๐ก2 = ๐ฎ2 . Since we cannot distinguish between
these consumer sets belong to loyalty campaign ๐1 . We se- the latter two, we make predictions for them using ๐1๐ก , and
lect a stratified sample of 75% from this set to train a model we assign the predicted labels using the learned ๐๐กโ๐ก1,1 . With
and later use the remaining 25% to optimize the threshold. the predicted and actual labels (which are available when ๐2
Many different types of classifiers exist, but in general the ends), we can then compute the classification metric. This is
training of a classifier involves finding a function that maxi- visually presented in Figure 4 in the white area. As discussed
mizes the predicted probability for consumers that belong previously, we are not able to update the prediction model
to the positive class and minimizing it for consumers that ๐1๐ก itself. However, we can change the threshold we use to
belong to the negative class. assign the predicted class from the predicted probabilities
The classifier that is trained on data up to time ๐ก is re- which is the topic of Section 4.
ferred to as ๐1๐ก . The reason is that we want to use it at time
๐ก in ๐2 , so only data up to that time in ๐1 should be used
to train the classifier, to make the encoding of consumers 4. OPUS framework
as similar as possible. This is indicated by the superscript ๐ก.
After training a classifier, we select a prediction threshold. In this section, we sketch the idea behind OPUS and then
If we set the threshold lower, more consumers will be pre- formalize it. Table 2 summarizes the new symbols. An
dicted as participants, if we set the threshold higher, fewer overview of how OPUS adds to the existing prediction prob-
consumers will be predicted as participants. For a given lem is shown in Figure 4.
binary classification metric and a set of consumers, there is We start from ๐๐ฎ ๐ก2 and ๐ฐ๐ฎ ๐ก2 . As we do not know the
a threshold that can be considered the best, as it results in labels of the latter, we cannot use these consumers to update
the best metric score. We call this the optimal threshold or ๐1๐ก . We can however compute two characteristics about
๐๐กโ๐ก1,1 . The consumers used to optimize the threshold come the current loyalty campaign ๐2 : the average probability
from the 25% that was not used in training. The double 1 computed by ๐1๐ก to each of them. We refer to these as
subscript indicates that it is the optimal threshold for the the average converted probability ๐๐๐๐ก1,2 and the average
model trained on ๐1 and optimized using data from ๐1 . The nonconverted probability ๐๐๐๐๐ก1,2 . The subscript 1 indicates
value of the metric for this optimal threshold is denoted by that ๐1๐ก is used for the model, the subscript 2 that ๐๐ฎ ๐ก2 and
๐๐๐ก๐๐๐๐ก1,1 . ๐ฐ๐ฎ ๐ก2 are used for the consumers. We can do the same for
Instead of training just one model and just one way of en- ๐1 , using ๐๐ฎ ๐ก1 and ๐ฐ๐ฎ ๐ก1 with ๐1๐ก to compute ๐๐๐๐ก1,1 and
coding, we can also train multiple different model types and ๐๐๐๐๐ก1,1 respectively. In general, for loyalty campaigns
different ways of encoding. We repeat the model training ๐๐ , ๐๐ and time ๐ก we have ๐๐๐๐ก๐,๐ = ๐๐โ๐๐ฎ ๐ก๐ [๐๐๐ก (๐)] and
and threshold optimization for each combination of model ๐๐๐๐๐ก๐,๐ = ๐๐โ๐ฐ ๐ฎ ๐ก๐ [๐๐๐ก (๐)]. Note that these values do not
type and encoding type. Each combination then results in a necessarily say something about the model quality. On the
value for the metric and we select the combination with the one hand, we expect ๐๐๐๐ก1,2 to be high, since we know that
highest value. This process is referred to as hyper-parameter it evaluates only positive consumers. On the other hand,
optimization. We discuss the exact model types in Section 5, we cannot make an expectation for ๐๐๐๐๐ก1,2 , as it evaluates
the different encoding types are beyond the scope of this pa- negative and possibly also positive consumers. One thing
per. After selecting the best combination of model type and
4.1. Baseline thresholds
For the baselines we consider values that at most use the
training data, so we only require data from the first loyalty
campaign. Without any alteration, a baseline of 0.5 is often
used as standard. We further add the optimal threshold
based on the training data, ๐๐กโ๐ก1,1 .
4.2. Heuristic thresholds
For the heuristic thresholds we consider values that also
make use of the second loyalty campaign, and which can
be used without further knowledge. For this, we consider
four linear formulas. The first, โ
๐๐๐, multiplies the optimal
thresholds from training, ๐๐กโ๐ก1,1 with the ratio between the
average converted probabilities, ๐๐๐๐ก1,2 /๐๐๐๐ก1,1 . The idea
behind this method is that if the average probability of the
Figure 4: The original prediction task (white area with solid converted consumers has increased (or decreased), then the
bounds) and how OPUS extends it (gray area with dashed bounds). model likely overestimates (underestimates) the probability
The set of converted consumers ๐๐ฎ is now used to tell something of a consumer being positive. Similarly, we can also use
about the model. the average non-converter probabilities for this through
multiplying ๐๐กโ๐ก1,1 by ๐๐๐๐๐ก1,2 /๐๐๐๐๐ก1,2 . We refer to this as
the โ
๐๐๐๐ method. Apart from the ratio, we can also take
Table 3 the difference, adding ๐๐๐๐ก1,2 โ ๐๐๐๐ก1,1 to ๐๐กโ๐ก1,1 , which we
All threshold methods evaluated in this paper
refer to as โ๐๐๐. The โ๐๐๐๐ method is defined in a similar
Type Name Description way. Note that in principle any of these four may result in
a value above 1, and the difference based heuristic method
Regular No altered threshold (0.5)
Base
๐๐กโ๐ก๐,๐ Optimal threshold from training phase may result in a value below 0. This does not invalidate the
threshold, but it means that all entities will be predicted to
โ
๐๐๐ ๐๐กโ๐ก๐,๐ multiplied by the ratio of ๐๐๐ be negative or positive, respectively.
Heuristic
โ
๐๐๐๐ ๐๐กโ๐ก๐,๐ multiplied by the ratio of ๐๐๐๐
ฮ๐๐๐ ๐๐กโ๐ก๐,๐ increased by the difference of ๐๐๐
ฮ๐๐๐๐ ๐๐กโ๐ก๐,๐ increased by the difference of ๐๐๐๐
4.3. Learned thresholds
DT ๐ ๐ ๐๐,๐ , with Decision Tree Regressor For the learned thresholds we need full information about
TPM
RF ๐ ๐ ๐๐,๐ , with Random Forrest Regressor the second loyalty campaign, meaning we can only apply
the learned thresholds to a third loyalty campaign. What the
heuristic methods have in common is that they use one or
to note is that for ๐๐๐๐ก1,2 we only evaluate consumers that more of the ๐๐๐, ๐๐๐๐ and ๐๐กโ values to make an estimation
have already converted by time ๐ก in ๐2 . It might be that for about the target value (๐๐กโ๐ก1,2 ). For the final set of threshold
๐ก early in the loyalty campaign we are including only a very estimation methods, we train a regression model, which
specific subset of eager consumers. Such consumers may we refer to as the threshold prediction model, or ๐ ๐ ๐ . If
show different behaviour from consumers that converted needed, we refer to ๐1๐ก as the consumer prediction model
later in the loyalty campaign. As such, ๐๐๐๐ก1,2 may not be an to distinguish between the two. ๐ ๐ ๐ takes as descriptive
accurate representation of the average predicted probability space all known values at ๐ก: the (non-)converted probabili-
of all positive consumers, ๐๐๐๐ก1,2 ๐
. This is not a problem ties ๐๐๐๐ก1,1 , ๐๐๐๐ก1,2 , ๐๐๐๐๐ก1,1 , ๐๐๐๐๐ก1,2 , the optimal training
per se. First, the model is still trained on all consumers, it threshold ๐๐กโ๐ก1,1 , the training metric value ๐๐๐ก๐๐๐๐ก1,1 , the
therefore also captures behaviour from the consumers that model specification (type and encoding) and the point in
convert later. Second, we can compare ๐๐๐๐ก1,2 to ๐๐๐๐ก1,1 . As time ๐ก. This is represented by the red lines Figure 4. As
such, both will be biased towards early redeemers if ๐ก is target space, we have the value of ๐๐กโ๐ก1,2 . We can sample
small. Therefore, we expect to compare the same type of these values for different values of ๐ก and for each of the
consumers. We can therefore use these values to compare different model specifications. Using two loyalty campaigns,
๐1 and ๐2 and adapt the prediction threshold. we create a training set for ๐ ๐ ๐ , with the target values
As we cannot change ๐1๐ก , the best we can do is changing coming from the second loyalty campaign and the descrip-
the prediction threshold. The best value is the one which tive features from both loyalty campaigns. We denote this
is computed based on the labelled sets of consumers of the ๐ ๐ ๐ as ๐ ๐ ๐1,2 , without the superscript ๐ก as multiple
second loyalty campaign; ๐ซ๐ฎ 2 and ๐ฉ ๐ฎ 2 . In line with the values of ๐ก are used to train ๐ ๐ ๐1,2 . Note that we cannot
previous indexing, this value is ๐๐กโ๐ก1,2 . We can only know use ๐ ๐ ๐1,2 to estimate a threshold to use during ๐2 , as ๐2
this value once ๐2 ends, so during the loyalty campaign we needs to be finished before ๐ ๐ ๐1,2 can be learned. OPUS
need different ways of estimating it. Apart from the Baseline does not depend on the specific regressor used as a base
methods, we differentiate between Heuristic thresholds and model for ๐ ๐ ๐ , in Section 5 we evaluate Decision Tree
Learned thresholds. All methods are presented in Table 3. and Random Forest regressors.
The idea behind the learned thresholds is that ๐ ๐ ๐1,2
learns how to adapt the prediction method (model and
threshold) from one loyalty campaign to another. This
means we could learn how the prediction method is best
โtransformedโ from ๐1 to ๐2 , and then apply this learned the store, using aggregates based on [18] and descriptive
transformation. Suppose that ๐1 rewards glassware, ๐2 re- labels based on [19] We finally adopt two strategies: with
wards pans, and a third ๐3 rewards kitchen knives. If we and without adding the converted consumers in the test
learn a transformation from glassware to pans, then we set to the training set. In either strategy, the prediction
can assess 1) how well does it work on transforming the quality metrics are only computed on the non-converted
prediction method for a different set of consumers from consumers in the test set. For each timestep we select the
glassware to pans and 2) how well does that transformation hyper-parameter combination with the best performance
apply to the transformation from pans to kitchen knives. In during the training phase using ๐๐กโ๐ก1,1 to report the results
the evaluation in Section 5 we limit the evaluation to the for all threshold estimation methods.
first due to the unavailability of additional data on a third For each combination of hyper-parameters we compute
loyalty campaign. ๐๐๐๐ก1,๐ , ๐๐๐๐๐ก1,๐ , and ๐๐กโ๐ก1,๐ for ๐ โ {1, 2} and each timestep
๐ก. On the one hand we can use this to compute all threshold
4.4. Existing Threshold Adaption techniques estimation methods discussed in Table 3. On the other hand
we can use all combinations to train threshold prediction
OPUS is not the first framework that optimizes thresholds. models. For each dataset ๐๐ฅ we compute two ๐ ๐ ๐ s, one
In GHOST [14], the optimal threshold of any classifier is using a Decision Tree regressor, ๐ท๐๐ฅ , and one using a Ran-
optimized on a validation set. The metric value for several dom Forest regressor ๐
๐น๐ฅ . These ๐ ๐ ๐ ๐ are then used in
thresholds is computed and averaged over several stratified every other dataset ๐๐ฆ , ๐ฆ ฬธ= ๐ฅ, to estimate optimal thresh-
subsets of a validation set. While this technique may prove olds. We also add two results for comparison. The first uses
effective for finding an optimal threshold for the original the actual optimal threshold (๐๐กโ๐ก1,2 ) and the second uses
training and validation data, there is no update in a later a prediction model learned on 75% of the data of loyalty
stage, and such an update would require labelled data. As campaign ๐2 and tested on the remaining 25% (Retrain).
such, GHOST is not suitable for the problem at hand. Simi- Both these methods violate the train-then-test principle but
larly to GHOST, [15] proposes a method to find the optimal are added as a comparison to how close we can get.
threshold in a training set. The authors prove that for such During the training phase, models are trained on a strati-
a method only the probabilities of positive class are required fied 75% split of the consumers that visited the store during
when optimizing for ๐น1 . While this technique potentially loyalty campaign ๐1 . The remaining 25% are used as test
saves computation time, it is not universally applicable to set. During the inference phase, models are trained on all
all metrics. In this paper we optimize the thresholds by consumers that visited the store during loyalty campaign
evaluating all predicted probabilities, and selecting the one ๐2 . The consumers that visited the store by time ๐ก are used
that optimizes the target metric. Finally, [16] proposes a as test set. The entire test set is used to compute ๐๐กโ๐ก1,๐
technique to adapt the separating hyperplane position of using the model trained on the training set. The converted
SVMs to improve its predictive quality. OPUS is different (non-converted) consumers by time ๐ก in the test set are used
from these because it tries to optimize the threshold using to compute ๐๐๐๐ก1,๐ (๐๐๐๐๐ก1,๐ ). The non-converted consumers
the new, partly labelled data in an online fashion. OPUS is by time ๐ก are used to compute the model performance for a
designed to improve the predictions under concept drift. given threshold estimation method.
Next to these methods, we also apply PU-learning defined
by [5]. For this we use the same hyper-parameters as for
5. Experimental Evaluation the other experiments. We first find the best combination
For the experimental evaluation1 we use data from a real- based on loyalty campaign ๐1 and then report the score for
world retailer containing a total of about 160 000 consumers that hyper-parameter combination in ๐2 , computed over all
shopping in exactly one of two consecutive loyalty cam- non-converted consumers at the respective point in time.
paigns, ๐1 and ๐2 . We partition the consumers in ten sep- We apply the above for both the ๐น1 and Accuracy metrics.
arate datasets ๐1 through ๐10 . For each dataset we build We present two types of results: those for timestep ๐ก = 4,
several prediction models based on a grid-search over hyper- separated per dataset, and those for all timesteps, aggregated
parameters and at different timesteps relative to the start of over the datasets. The reason for separately reporting ๐ก = 4
the loyalty campaign. comes from domain knowledge, this is the point in the
Models are trained and tested at multiple points in time loyalty campaign where additional rewards for the loyalty
relative to the loyalty campaign, after 2, 4, 6, . . . , 18 weeks campaign can still be ordered by the supermarket.
into the campaign, ending before ๐ก๐ . The encoding of con-
sumers is based on the data available from the pre loyalty 5.1. Results for ๐ก = 4
campaign period and the data during the loyalty campaign
Results are presented per split (one per row) and threshold
up to ๐ก. We consider three hyper-parameters: the base
estimation method (one per column) in Figure 5. We do
model, the encoding technique, and a strategy. The base
not report the values of the combinations where a ๐ ๐ ๐ is
models are the consumer prediction models, for which we
learned from the same split as the data it is tested on (the
use ๐-nearest neighbours (๐NN), decision tree classifiers
white diagonals).
(DTs), support vector classifiers (SVCs) and Adaptive Hoeffd-
Baseline methods For the baseline methods we see that
ing Option Trees (AHOTs), all with their default parameters
using the regular value is not much better than always pre-
in Scikit-Learn for ๐NN, DTs, SVCs, and Scikit-Multiflow
dicting True or False, almost all scores being 0.50, which is
[17] for AHOTs, except for setting a constant seeding and
effectively the worst value one can get for with balanced ac-
training the SVCs on probabilities instead of labels. The en-
curacy in a binary classification. Using the optimal training
coding techniques define what type of features we use about
threshold ๐๐กโ๐ก๐,๐ works much better, with the exception of
a consumer. The features are based on individual visits to
๐5 , ๐6 and ๐7 .
1
For the implementation, see github.com/YorickSpenrath/opus
BL Heuristic TPM
c0 .50 .82 .81 .82 .50 .82 .82 .83 .83 .83 .86 .83 .83 .82 .82 .82 .83 .86 .83 .83 .83 .83 .82 .71 .86 .86 .68
c1 .50 .80 .80 .80 .66 .80 .68 .80 .83 .83 .82 .82 .80 .80 .80 .82 .80 .83 .82 .80 .82 .80 .80 .80 .83 .85 .50
c2 .50 .82 .82 .82 .50 .82 .82 .82 .85 .82 .85 .85 .82 .82 .82 .82 .82 .82 .85 .82 .82 .82 .82 .82 .85 .87
c3 .55 .77 .78 .78 .84 .77 .84 .84 .84 .77 .77 .84 .78 .81 .84 .84 .84 .84 .81 .81 .81 .78 .78 .77 .84 .87 .82
c4 .50 .82 .82 .82 .72 .82 .52 .84 .82 .84 .84 .52 .82 .84 .82 .84 .84 .82 .84 .84 .84 .82 .84 .82 .85 .86 .51
c5 .50 .68 .81 .81 .81 .68 .81 .83 .85 .83 .81 .83 .68 .83 .83 .85 .85 .85 .85 .85 .85 .81 .83 .81 .85 .82
c6 .50 .68 .68 .81 .68 .68 .86 .86 .86 .81 .86 .68 .81 .68 .86 .81 .81 .86 .81 .86 .81 .81 .68 .81 .87 .88 .65
c7 .50 .66 .66 .66 .66 .66 .66 .84 .81 .84 .84 .84 .84 .84 .66 .84 .84 .84 .84 .84 .84 .84 .84 .81 .84 .87
c8 .50 .86 .86 .86 .50 .50 .86 .86 .86 .86 .77 .77 .86 .77 .77 .86 .86 .86 .86 .86 .86 .86 .86 .77 .86 .86
c9 .50 .80 .80 .80 .80 .80 .80 .80 .80 .80 .80 .80 .80 .80 .80 .80 .80 .80 .80 .80 .80 .80 .80 .80 .80 .85 .78
Regular
Retrain
acp
ancp
ancp
othi,t i
othi,t j
RF0
RF1
RF2
RF3
RF4
RF5
RF6
RF7
RF8
RF9
acp
PU
DT0
DT1
DT2
DT3
DT4
DT5
DT6
DT7
DT8
DT9
Figure 5: Accuracy at timestep 4. Values shown in bold are strictly better than the ๐๐กโ๐ก๐,๐ , the ones in italic are strictly smaller.
Higher values correspond to a brighter colour. The white diagonals are skipped since those the ๐ ๐ ๐ would be trained on the
same split we test it on. Some scores for PU-learning are left out; for these splits there were not enough converted consumers
to train a meaningful prediction model.
Heuristic methods For the heuristic methods we have describes the benefit of using the learned method in OPUS
that, except for (๐8 , โ๐๐๐๐), the ๐๐๐๐ methods consistently in general. We see what one can on average expect using
performs as well or better than the baseline ๐๐กโ๐ก๐,๐ . The ๐๐๐ any other dataset to train the ๐ ๐ ๐ to estimate the opti-
values do much worse however, especially for the โ๐๐๐. mal threshold is beneficial. What is more, mainly for the
As explained in Section 4.2, if the difference between ๐๐๐๐ก๐,๐ Random Forest ๐ ๐ ๐ , we see that we get values that are
and ๐๐๐๐ก๐,๐ is too high, this method might end up predicting close to using the actual optimal threshold ๐๐กโ๐ก๐,๐ . For the
all shoppers as redeemers or all shoppers as non-redeemers. heuristic thresholds, we see that the ratio-based ones (โ
)
This is what happens for splits 0, 2 and 8. are performing close to or better than the ๐๐กโ๐ก๐,๐ baseline.
Learned methods With only a few exceptions, the For PU-learning we see that the results improve over time
learned methods always perform at least as good as the as more positive labels are known. This is to be expected,
baseline and often beat it. Furthermore, for most splits the though it does not yet catch with OPUS methods.
best values even come close to the results for ๐๐กโ๐,๐ , the F1 We see something similar to the accuracy results. This
value which we aim to estimate. Next to this, we see that is expected, as we are penalizing in a similar way by com-
the Random Forest ๐ ๐ ๐ performs almost always as well paring the actual with the predicted labels for each shopper.
as or better than the Decision Tree based ๐ ๐ ๐ , compared There are two important differences. The first is that we
on the same combination of splits. This is to be expected see lower values. This is because ๐น1 does not account for
given the complexity of the Random Forest model and that class imbalance as we did for the balanced accuracy score.
it can as such capture more difficult relations. Given that the redeemer class is much smaller than the non-
PU-Learning One weakness of PU-learning is if the num- redeemer class, the reported values for ๐๐กโ๐ก๐,๐ and Retrain
ber of known positive labels is small. This results in a base are reasonable. This is also seen by the non ๐๐กโ๐ก๐,๐ baselines,
prediction model that cannot predict positive points, result- which have a much lower score. The second is that we are
ing in an average probability of 0 for the known positive further off from the optimal thresholds ๐๐กโ๐ก๐,๐ . This is likely
labels. In such a scenario, PU-learning fails to produce any because of a similar reason: the ๐น1 is much more penalizing
result. This is what happens in 4 of the splits. Note that for imbalanced datasets, and as such, being able to use the
OPUS does not have this problem, as we are using the la- optimal thresholds/all labels allows for a much better score
belled data from last loyalty campaign. In the splits where than estimating it using any of the methods from OPUS.
there are enough known positive labels, it is still outper- Like accuracy, the results of PU-learning improve over time,
formed by OPUS, for a similar reason. eventually even improving over OPUS methods. The lower
accuracy and higher ๐น1 for PU-learning is likely caused by
5.2. Results aggregated per timestep the unbalanced dataset as well.
We next aggregate the results over the splits for each
timestep in Figures 6a and 6b. We report the mean and 6. Conclusion
95% confidence interval of the metric for every timestep
(each row). For most estimation methods (each column) this In this paper we have presented OPUS, a framework to es-
is the average over the 10 values from different splits, for the timate a better prediction threshold when only one class
learned estimation methods we consider all combinations. of the target labels is available. The advantage of such
In other words, the reported aggregated values for ๐ท๐ and optimizations is that we can adapt existing prediction mod-
๐
๐น are taken over all 90 combinations. els without having to rely on fully labelled data. This is a
Accuracy Some results from Section 5.1 can be trans- more realistic assumption when the true label of one class
ferred to more timesteps. The learned thresholds seem to is deferred. OPUS makes use of two types of threshold esti-
perform better than using ๐๐กโ๐ก๐,๐ . Note that, as we have mation methods, where learned methods are more effective
averaged over both the splits used for the evaluation as in getting to an optimal prediction threshold at the cost of
well as the splits used for computing the ๐ ๐ ๐ , this result requiring more data (only being available at a third cam-
Baseline Heuristic TPM Comparison
4 .50 ยฑ.01 .77 ยฑ.04 .79 ยฑ.04 .80 ยฑ.03 .67 ยฑ.08 .74 ยฑ.06 .81 ยฑ.01 .82 ยฑ.01 .85 ยฑ.01 .86 ยฑ.01 .66 ยฑ.10
8 .53 ยฑ.02 .80 ยฑ.03 .80 ยฑ.03 .80 ยฑ.04 .65 ยฑ.10 .78 ยฑ.06 .81 ยฑ.01 .82 ยฑ.01 .84 ยฑ.02 .88 ยฑ.01 .73 ยฑ.03
12 .53 ยฑ.03 .78 ยฑ.04 .76 ยฑ.05 .76 ยฑ.05 .68 ยฑ.08 .79 ยฑ.04 .78 ยฑ.02 .80 ยฑ.02 .83 ยฑ.03 .86 ยฑ.01 .74 ยฑ.03
16 .60 ยฑ.04 .79 ยฑ.06 .77 ยฑ.07 .77 ยฑ.05 .72 ยฑ.09 .76 ยฑ.07 .80 ยฑ.01 .81 ยฑ.01 .83 ยฑ.03 .89 ยฑ.01 .73 ยฑ.04
(a) Accuracy
4 .03 ยฑ.03 .14 ยฑ.09 .17 ยฑ.09 .20 ยฑ.09 .21 ยฑ.09 .12 ยฑ.09 .25 ยฑ.03 .29 ยฑ.02 .34 ยฑ.03 .40 ยฑ.02 .15 ยฑ.08
8 .09 ยฑ.05 .19 ยฑ.08 .22 ยฑ.09 .20 ยฑ.08 .12 ยฑ.06 .20 ยฑ.08 .20 ยฑ.03 .22 ยฑ.03 .33 ยฑ.04 .39 ยฑ.03 .34 ยฑ.03
12 .06 ยฑ.05 .21 ยฑ.06 .18 ยฑ.07 .19 ยฑ.07 .15 ยฑ.06 .19 ยฑ.07 .20 ยฑ.02 .20 ยฑ.02 .25 ยฑ.05 .35 ยฑ.02 .32 ยฑ.01
16 .07 ยฑ.04 .11 ยฑ.05 .10 ยฑ.05 .09 ยฑ.05 .11 ยฑ.05 .11 ยฑ.05 .15 ยฑ.01 .17 ยฑ.01 .20 ยฑ.02 .31 ยฑ.03 .25 ยฑ.01
Regular
DT
RF
Retrain
acp
ancp
ancp
othi,t i
othi,t j
acp
PU
(b) ๐น1
Figure 6: a) Accuracy and b) ๐น1 at each timestep, averaged over the splits with their 95% confidence interval. Values shown
in bold are strictly better than the ๐๐กโ๐ก๐,๐ , the ones in italic are strictly worse (in terms of the reported mean). Better values
correspond to a brighter colour.
paign), compared to heuristic methods that do not have robust ensemble approach to learn from positive and
these benefits or limitations. While the initial results are unlabeled data using SVM base models, Neurocom-
promising, we want to elaborate the experimental analysis, puting 160 (2015) 73โ84.
both by including more than two loyalty campaigns and [9] J. Bekker, J. Davis, Learning from positive and unla-
by applying OPUS on a different use case inside the student beled data: a survey, Machine Learning 109 (2020)
learning domain. In the design of OPUS only characteristics 719โ760.
that belong to the current timestep are considered, not of [10] P. De Abreu Lopes, H. De Arruda Camargo, Fuz-
all previous timesteps. We argue that the progression of zStream: Fuzzy data stream clustering based on the
these characteristics, as well as the model performance over online-offline framework, IEEE International Confer-
time, and how they compare to the same period in the pre- ence on Fuzzy Systems (2017).
vious loyalty campaign can be interesting to improve the [11] T. P. da Silva, V. M. A. Souza, G. E. A. P. A. Batista,
framework in future work. H. de Arruda Camargo, A fuzzy classifier for data
streams with infinitely delayed labels, LNCS 11401
(2019) 287โ295.
References [12] K. B. Dyer, R. Capo, R. Polikar, Compose: A semisuper-
vised learning framework for initially labeled nonsta-
[1] W. Rizzi, C. Di Francescomarino, C. Ghidini, F. M.
tionary streaming data, IEEE Transactions on Neural
Maggi, How do I update my model? On the resilience
Networks and Learning Systems 25 (2014) 12โ26.
of Predictive Process Monitoring models to change,
[13] B. Settles, Active Learning Literature Survey, Techni-
KAIS 64 (2022) 1385โ1416.
cal Report, University of Wisconsin-Madison, 2009.
[2] C. Fahy, S. Yang, M. Gongora, Classification in Dy-
[14] C. Esposito, G. A. Landrum, N. Schneider, N. Stiefl,
namic Data Streams with a Scarcity of Labels, IEEE
S. Riniker, GHOST: Adjusting the Decision Threshold
TKDE (2021).
to Handle Imbalanced Data in Machine Learning, Jour-
[3] V. M. A. Souza, D. F. Silva, J. Gama, G. E. A. P. A. Batista,
nal of Chemical Information and Modeling 61 (2021)
Data Stream Classification Guided by Clustering on
2623โ2640.
Nonstationary Environments and Extreme Verification
[15] Q. Zou, S. Xie, Z. Lin, M. Wu, Y. Ju, Finding the Best
Latency, in: SDM, 2015, pp. 873โ881.
Classification Threshold in Imbalanced Classification,
[4] N. J. Bombaij, S. Gelper, M. G. Dekimpe, Designing suc-
Big Data Research 5 (2016) 2โ8.
cessful temporary loyalty programs: An exploratory
[16] H. Yu, C. Mu, C. Sun, W. Yang, X. Yang, X. Zuo, Support
study on retailer and country differences, International
vector machine-based optimized decision threshold
Journal of Research in Marketing 39 (2022) 1275โ1295.
adjustment strategy for classifying imbalanced data,
[5] C. Elkan, K. Noto, Learning classifiers from only posi-
Knowledge-Based Systems 76 (2015) 67โ78.
tive and unlabeled data, KDD (2008) 213โ220.
[17] J. Montiel, J. Read, A. Bifet, T. Abdessalem, Scikit-
[6] J. Peeperkorn, C. O. Vรกzquez, A. Stevens, J. D. Smedt,
Multiflow: A Multi-output Streaming Framework,
S. Broucke, J. D. Weerdt, Outcome-Oriented Predictive
JMLR 19 (2018) 1โ5.
Process Monitoring on Positive and Unlabelled Event
[18] Y. Spenrath, M. Hassani, B. F. van Dongen, Online
Logs, ICPM workshops (2022).
prediction of aggregated retailer consumer behaviour,
[7] J. Bekker, P. Robberechts, J. Davis, Beyond the Selected
ICPM Workshops (2021).
Completely at Random Assumption for Learning from
[19] Y. Spenrath, M. Hassani, B. v. Dongen, H. Tariq, Learn-
Positive and Unlabeled Data, LNCS 11907 LNAI (2020)
ing an efficient distance metric for retailer transaction
71โ85.
data, in: PKDD, 2020, pp. 323โ338.
[8] M. Claesen, F. De Smet, J. A. Suykens, B. De Moor, A