Online Prediction Threshold Optimization Under Semi-deferred Labelling Yorick Spenrath1,* , Marwan Hassani1 and Boudewijn F. van Dongen1 1 Process Analytics Group, Faculty of Mathematics and Computer Science, Eindhoven University of Technology, The Netherlands Abstract In supermarket loyalty campaigns, shoppers collect stamps to redeem limited-time luxury products. Having an accurate prediction of which shoppers will eventually redeem is crucial to effective execution. With the ultimate goal of changing shopper behavior, it is important to ensure an adequate number of rewards and to be able to steer promising shoppers into joining the campaign and redeeming a reward. If information from previous campaigns is available, a prediction model can be built to predict the redemption probability, possibly also adapting the prediction threshold to determine predicted the label. During a running campaign, we only know a subset of the labels of the positive class (the so-far redeemers), and have no access to the labels of any example of the negative class (non-redeemers at the end of the campaign). The majority of the examples during the campaign do not have a label yet (shoppers that could still redeem but have not done so yet). This is a semi-deferred labelling setting and our goal is to improve the prediction quality using this limited information. Existing work on predicting (semi-deferred) labels either focuses on positive-unlabelled learning, which does not use existing models, or updates models after the prediction is made by assigning expected labels using unsupervised learning models. In this paper we present a framework for Online Prediction threshold optimization Under Semi-deferred labelling (OPUS). Our framework does not change the existing model, but instead adapts the prediction threshold that decides which probability is required for a positive label, based on the semi-deferred labels we already know. We apply OPUS to two real-world datasets: a supermarket with two campaigns and over 160 000 shoppers. Keywords Prediction threshold, Online, Semi-deferred labels, Supermarket loyalty campaign 1. Introduction Known at t Unknown at t Negative example Unknown example Positive example Traditional supervised machine learning projects start with a set of labelled data. Specifically in binary classification, data belongs to one of two classes: positive and negative. Starting from a set of examples with their ground truth label, I II III IV a predictor is created, which can assess the probability that 0 t te an unseen example belongs to the positive class. Without Offline training Only + labels More + labels All labels optimization, an example is predicted to be part of the posi- Figure 1: I) In an offline phase, all labels are present. II) In the tive class if this probability is at least 0.5. One improvement online phase, this data is potentially outdated, but we do know a is to select a different prediction threshold that distinguishes limited number of positive labels. III) The set of positive labels examples between the positive and negative class on their grows as time progresses. IV) At time ๐‘ก๐‘’ , all labels of the online probability. In the presence of labelled data, such a threshold phase are known. At time ๐‘ก we can only use the offline data (I) and the positive and unlabelled examples (II). can be picked in a way that the predicted labels maximize a given metric. Provided that no concept drift occurs, the model with the adapted threshold can be used indefinitely without degrading prediction quality. In this scenario, we If the latency becomes too high, models can no longer prop- only use the โ€˜offline trainingโ€™ (I) of Figure 1. erly be corrected for concept drift, or not even be updated This assumption on the data is however not realistic since at all if the latency is infinite [3]. changes in the data distribution such as concept drift can In this paper we consider a special class of the latter, reduce the quality of the model. A common technique is to where we know for some examples that they are positive retrain or update the model at a later point in time, when and do not know the label of the other examples (they may more recent labelled data is available [1] and potentially be either positive or negative). This is sketched in Figure 1. also updating the prediction threshold. This training is re- At time ๐‘ก, we have the (potentially outdated) offline training ferred to as online training, where new labelled examples data. We have further seen new examples, of which we are received and can be used to update the model and/or know some are positive. It is however not until time ๐‘ก๐‘’ that prediction threshold. Such techniques assume that the label we know the actual labels of all examples, and up to that latency, that is the time between seeing an example and its point we only receive the actual label of positive examples. true label, is small. Under these conditions prediction sys- We refer to this as semi-deferred labelling, as we have some tems can react to concept drift as it occurs. This assumption labelled examples, but only from one class. This still does may not be valid in all real-world scenarios, for example not allow us to update or retrain the model, neither does because it is computationally or labour intensive to do so [2]. it allow us to set a new threshold that optimizes our target metric. Setting this new threshold is important, since we Published in the Proceedings of the Workshops of the EDBT/ICDT 2024 still want to make a prediction for all unlabelled examples. Joint Conference (March 25-28, 2024), Paestum, Italy We therefore introduce OPUS, Online Prediction threshold * Corresponding author. optimization Under Semi-deferred labelling, which aims to $ y.spenrath@tue.nl (Y. Spenrath); m.hassani@tue.nl (M. Hassani); set a better threshold for an existing prediction model, based b.f.v.dongen@tue.nl (B. F. v. Dongen) on the limited available positive examples. Note that this  0000-0003-0908-9163 (Y. Spenrath); 0000-0002-4027-4351 (M. Hassani); 0000-0002-3978-6464 (B. F. v. Dongen) is not the same as imbalanced training, as we have only a ยฉ 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribu- tion 4.0 International (CC BY 4.0). single class of up-to-date data and do not make assumptions CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings about the possibly outdated data. The first is retraining the classifier with an extended training Even though OPUS can be generalized to other scenarios, set, where the new positive examples are added. This can we discuss a supermarket one in this paper. Shoppers in this either have limited effect if the number of positive labels is supermarket make use of loyalty cards which identify them small or it can heavily bias the prediction model towards the at each purchase they make. Once a year, the supermarket positive class if the number of positive labels is high. The holds a so-called loyalty campaign [4]. During the campaign, second category builds a model based on the positive and shoppers may collect stamps. These stamps can be collected unlabelled state of the examples. The third is to ignore the by spend (for example one stamp for every 10 monetary positive labels, and assume all new data is unlabelled. All units) or in special promotions (an extra stamp for a certain these techniques make a new prediction model or update product). If a shopper has collected enough stamps they can the existing one. We discuss the second and third categories purchase a reward. Rewards are usually luxurious products below, including their specific disadvantages in the semi- available at a much lower (sometimes only symbolic) price. deferred labelling problem. Shoppers are as such persuaded to participate in the loyalty campaign. In this work we consider campaigns with a lim- 2.1. PU-learning ited time scope: shoppers can only collect the stamps and redeem rewards within the duration of the campaign. For The original PU-learning solution was proposed in [5]. Here such a campaign to work, the supermarket must accurately a classifier is build using positive and unlabelled examples, predict how many rewards are required, such that each par- where the class of the unlabelled examples is not known. ticipating shopper can actually get their rewards. Having Semi-deferred labelling is less restrictive in two ways. The too few rewards means disappointing shoppers, having too first is that we do have a, potentially outdated, set of la- many means there will be unusable stock left. It is beneficial belled data. For the supermarkets, this is a previous loyalty to make predictions about what shoppers will do during campaign. The second is that we do not have a single set the rest of the campaign, in an effort to either steer their of positive examples but rather receive additional positive decisions or adapt the available rewards later on. examples later on. Because Semi-deferred labelling is less For our prediction model, we are interested in whether restrictive, PU-learning can still be applied to our problem a shopper will make a redemption: shoppers have positive at different points in time, though it might not be as effective labels if they redeem a reward during the campaign, or because it ignores part of the data. negative labels otherwise. Consider two campaigns, ๐‘1 and A solution to PU-learning was originally proposed in [5]. ๐‘2 . During ๐‘2 we want to predict whether a shopper will Given a set of examples, of which only a fraction is labelled, redeem, based on a classifier we learned from the finished and only from a single class (positive), the authors reason campaign ๐‘1 . We continuously receive new data and make that the conditional probability that an example belongs new predictions and the end of every week. In practice this to the positive class is equal to the conditional probability means we get new positive examples in batches, these are all that the example is labelled, divided by the probability a consumers that first redeemed in the preceding week. Let ๐‘ก positive sample is labelled. From this reasoning, one can be the moment of prediction, since the start of ๐‘2 . We train construct a classifier on whether a sample is labelled and a classifier and select the best threshold. To do so, we split use that to predict the probability that positive samples are the data of campaign ๐‘1 into a train and validation set, and labelled (as average of the classifier outcome for the labelled train several models on the former and adjust the threshold examples). For the prediction of the unlabelled examples, for each model to maximize a target metric using the latter. the probability given by the classifier is divided by this latter One of these models has the highest metric value and we value. An extension of this is discussed in [6], where some select that as our prediction model. We retrain that model of its negative effects of misclassification are discussed. PU- on all shoppers of campaign ๐‘1 , and test it on campaign learning usually assumes that the examples are labelled ๐‘2 , using the corresponding learned optimal threshold. The completely at random. In [7] this assumption is abandoned, problem with this approach is that the model trained on where the probability of an example being labelled depends campaign ๐‘1 might not work as well on campaign ๐‘2 . We also do not get new negative labels during ๐‘2 , which means that updating our prediction model or creating a new one is not feasible. The best we can do is to adapt the prediction threshold based on the limited labels we have available for ๐‘2 . This is the main contribution of this paper, OPUS is a framework that adapts this threshold, based on the new labelled data. The remainder of this paper is structured as follows. In Section 2 we discuss existing solutions to the semi-deferred prediction problem. In Section 3 we formalize the prediction problem for loyalty campaign participation. In Section 4 we then discuss OPUS. We finally evaluate OPUS against alternative methods in Section 5 and conclude the paper in Figure 2: Various Semi-deferred labelling solutions and what Section 6. data they use. 1) [Retrain] the model with the positive examples added to the existing dataset 2) [PU-learning] builds a model based on the positive and unlabelled state of the examples 3) 2. Related Work [Ignore] the positive labels, only using the examples themselves. All three categories ignore part of the data, while OPUS uses all For this problem there are roughly three existing categories available data. of solutions. These are schematically presented in Figure 2. on its attributes. This is closer to our specific scenario, as learners can query selected labels from domain experts. An consumers that are labelled as positive earlier in the loyalty example of dealing with such conditions in a multiclass campaign may have different characteristics than consumers setting is COCEL [2]. An ensemble of one-class classifiers is labelled later in the loyalty campaign. In [8], the assumption kept, each predicting only whether an example belongs to that all labels are accurate is dropped: the authors propose a its respective class. If none of the classifiers recognizes the robust method of an ensemble of Support Vector Machines new example as one of their class, the example is added to a (SVMs) that can deal with the impurity of positive examples. buffer. Clusters in this buffer are labelled by domain experts. For loyalty campaign participation prediction, this is not Similar to COMPOSE, such solutions are not applicable in applicable, we have an exact definition of positive labels. In our scenario, as we cannot know the actual label until the addition to these works, a more general overview of PU- end of the loyalty campaign. learning is discussed in [9]. What all these techniques have in common is that existing predictions models are not used or changed. While this does 3. Prediction Task not invalidate their use in semi-deferred labelling, their more Before we explain how OPUS works, we first discuss the restrictive assumption does not allow them to benefit from prediction task itself. All symbols used are summarized in the existing data. Table 1. As introduced in Section 1, we want to solve the problem: โ€œGiven the data related to a consumer at time ๐‘ก 2.2. Ignoring new labels during the loyalty campaign can we predict whether they A more restrictive assumption on the labels is that no labels will participate in the loyalty campaign?โ€. This is a problem are available, short of an initial training set. This would of binary classification: for each consumer we want to deter- be comparable to ignoring newly converted consumers in mine whether they are a member of one of two classes: the the second loyalty campaign. Research dealing with such positive class (participant) or negative class (no participant). scenarios is often done in a streaming setting, either by pre- One way to do this is to use a classifier with a prediction dicting labels based on clustering techniques, or using semi- threshold. A classifier is a function that assigns a probability supervised learning methods. In [3], the authors propose a to the encoding of a consumer. If the probability exceeds the framework called SCARG. New examples are first classified prediction threshold, the consumer is predicted to belong to by an existing classifier. After enough new examples are the positive class, otherwise they are predicted to belong to seen, new clusters are formed. The clusters are matched the negative class. While classifiers can usually be trained to to existing clusters by their medoids and as such given a make a prediction between more than two classes, we only (new) label. Points in the clusters are then used to create a consider binary classifiers in this paper which, without loss new prediction model. A similar approach using fuzzy clus- of generalization, predict the probability that a consumer is tering [10] is made in [11]. Instead of making predictions part of a positive class. using an existing classifier, the COMPOSE framework dis- Let ๐‘€ : ๐’ฎ โ†’ [0, 1] be a function with ๐’ฎ the set of cussed in [12] uses a semi-supervised approach. Assuming all consumers. ๐‘€ is a binary classifier that estimates the an initial set of labels, new unlabelled examples are classi- probability that a consumer ๐‘  belongs to the positive class. fied in a semi-supervised manner. Next, only a selection For a given prediction threshold ๐‘กโ„Ž โˆˆ [0, 1], consumer ๐‘  of these examples are kept for the next timestep, at which is predicted to belong to the positive class if and only if new unlabelled examples are again classified using the semi- ๐‘€ (๐‘ ) > ๐‘กโ„Ž. The binary classifier and the prediction thresh- supervised learner. The main contribution of COMPOSE old together assign a predicted binary class, this is the binary is the way in which the kept examples are selected: using classification of the consumer. In this paper, the focus is ๐›ผ-shapes which are compacted until a desired number of on determining the threshold. However, before we do so, examples remain. we explain what data is used to train the classifier for the The commonality in these approaches is that they focus prediction task. This is presented schematically in Figure 3. on a streaming setting. At specific points in time, one or more existing models are updated using estimated labels Table 1 from the respective solution. Although the authors show the Symbols used for the prediction task effectiveness in dealing with streaming data by seeing many new examples at different timesteps, there are three key dif- Symbol Meaning ferences that make these methods unsuitable for our setting. ๐‘ก Time since start of loyalty campaign for which First, we do not have a constant stream of new examples, we train a model/make a prediction instead we see new information about existing examples at ๐‘๐‘– Loyalty Campaign, where we make predictions discrete points in time. Second, we are only interested in during ๐‘2 using a model trained on ๐‘1 improving a single model once, as we evaluate the existing ๐’ฎ๐‘– Set of consumers that visited during ๐‘๐‘– model several weeks into the loyalty campaign. Third, we ๐’ซ๐’ฎ ๐‘– Subset of ๐’ฎ๐‘– that redeemed during ๐‘๐‘– do not expect the drift to be gradual; we are starting a whole ๐’ฉ ๐’ฎ๐‘– Subset of ๐’ฎ๐‘– that did not redeem during ๐‘๐‘– new loyalty campaign a year later, so our expected type of ๐‘€๐‘–๐‘ก Binary classification model trained on ๐‘๐‘– drift is more of the abrupt kind. If we were to apply SCARG ๐‘กโ„Ž Threshold to assign a label based on the proba- bility predicted by a binary classification model or fuzzy clustering to predict loyalty campaign participa- ๐’ž๐’ฎ ๐‘ก๐‘– Subset of ๐’ฎ๐‘– that redeemed by time ๐‘ก in ๐‘1 tion it would effectively be a more elaborate ๐‘˜NN classifier. ๐’ฐ ๐’ฎ ๐‘ก๐‘– Subset of ๐’ฎ๐‘– that not (yet) redeemed by time ๐‘ก COMPOSE is not applicable at all, since there is no way of ๐‘œ๐‘กโ„Ž๐‘ก๐‘–,๐‘— Optimal threshold based on probabilities com- knowing the true label of non-participating consumers until puted by ๐‘€๐‘–๐‘ก for shoppers in ๐‘๐‘— the end of the loyalty campaign. ๐‘š๐‘’๐‘ก๐‘Ÿ๐‘–๐‘๐‘ก๐‘–,๐‘— Metric value computed using ๐‘œ๐‘กโ„Ž๐‘ก๐‘–,๐‘— and proba- Another solution to unavailability to unlabelled data is bilities computed by ๐‘€๐‘–๐‘ก for shoppers in ๐‘๐‘— active learning, such as in [13]. Instead of having no labels, Table 2 Symbols used by OPUS Symbol Meaning ๐‘Ž๐‘๐‘๐‘ก๐‘–,๐‘— The average probability of ๐’ž๐’ฎ ๐‘ก๐‘— predicted by ๐‘€๐‘–๐‘ก ๐‘Ž๐‘›๐‘๐‘๐‘ก๐‘–,๐‘— The average probability of ๐’ฐ ๐’ฎ ๐‘ก๐‘— predicted by ๐‘€๐‘–๐‘ก ๐‘‡ ๐‘ƒ ๐‘€๐‘–,๐‘— Threshold prediction model trained on loyalty campaigns ๐‘๐‘– and ๐‘๐‘— encoding type, we train a final model, which now includes all consumers from ๐‘1 , and keep the previously computed threshold of that combination. With this new model and the optimal threshold, we can make predictions in the following Figure 3: Overview of learning the model and optimal threshold loyalty campaign, ๐‘2 . in the first loyalty campaign. Since the first loyalty campaign has ended we know the labels and we can split the consumers in a positive set ๐’ซ๐’ฎ and a negative set ๐’ฉ ๐’ฎ . 3.2. Inference phase: the second loyalty campaign With the trained model, we can start making predictions 3.1. Training phase: the first loyalty during the second loyalty campaign. At point ๐‘ก in ๐‘2 we campaign only have data available up to point ๐‘ก, so that is what we use in the encoding of consumers. We can identify two sets We make a prediction at time ๐‘ก relative to the start of the of consumers. The first are those who have redeemed and loyalty campaign, which ends at ๐‘ก๐‘’ . We start by learning have a positive label: ๐’ž๐’ฎ ๐‘ก2 . The second are those who have a classifier from a previous loyalty campaign. For ๐‘1 we not (yet) redeemed and may or may not still do so: ๐’ฐ ๐’ฎ ๐‘ก2 . know which class each consumer that visited the store dur- Both are indexed by ๐‘ก, as explained in Section 3 the sets ing ๐‘1 belongs to, this is whether they redeemed a reward change over time. Note that ๐’ž๐’ฎ ๐‘ก2 โІ ๐’ซ๐’ฎ 2 , ๐’ฉ ๐’ฎ 2 โІ ๐’ฐ๐’ฎ ๐‘ก2 (๐’ซ๐’ฎ 1 ) or not (๐’ฉ ๐’ฎ 1 ). The 1 in the subscript indicates that and ๐’ž๐’ฎ ๐‘ก2 โˆช ๐’ฐ๐’ฎ ๐‘ก2 = ๐’ฎ2 . Since we cannot distinguish between these consumer sets belong to loyalty campaign ๐‘1 . We se- the latter two, we make predictions for them using ๐‘€1๐‘ก , and lect a stratified sample of 75% from this set to train a model we assign the predicted labels using the learned ๐‘œ๐‘กโ„Ž๐‘ก1,1 . With and later use the remaining 25% to optimize the threshold. the predicted and actual labels (which are available when ๐‘2 Many different types of classifiers exist, but in general the ends), we can then compute the classification metric. This is training of a classifier involves finding a function that maxi- visually presented in Figure 4 in the white area. As discussed mizes the predicted probability for consumers that belong previously, we are not able to update the prediction model to the positive class and minimizing it for consumers that ๐‘€1๐‘ก itself. However, we can change the threshold we use to belong to the negative class. assign the predicted class from the predicted probabilities The classifier that is trained on data up to time ๐‘ก is re- which is the topic of Section 4. ferred to as ๐‘€1๐‘ก . The reason is that we want to use it at time ๐‘ก in ๐‘2 , so only data up to that time in ๐‘1 should be used to train the classifier, to make the encoding of consumers 4. OPUS framework as similar as possible. This is indicated by the superscript ๐‘ก. After training a classifier, we select a prediction threshold. In this section, we sketch the idea behind OPUS and then If we set the threshold lower, more consumers will be pre- formalize it. Table 2 summarizes the new symbols. An dicted as participants, if we set the threshold higher, fewer overview of how OPUS adds to the existing prediction prob- consumers will be predicted as participants. For a given lem is shown in Figure 4. binary classification metric and a set of consumers, there is We start from ๐’ž๐’ฎ ๐‘ก2 and ๐’ฐ๐’ฎ ๐‘ก2 . As we do not know the a threshold that can be considered the best, as it results in labels of the latter, we cannot use these consumers to update the best metric score. We call this the optimal threshold or ๐‘€1๐‘ก . We can however compute two characteristics about ๐‘œ๐‘กโ„Ž๐‘ก1,1 . The consumers used to optimize the threshold come the current loyalty campaign ๐‘2 : the average probability from the 25% that was not used in training. The double 1 computed by ๐‘€1๐‘ก to each of them. We refer to these as subscript indicates that it is the optimal threshold for the the average converted probability ๐‘Ž๐‘๐‘๐‘ก1,2 and the average model trained on ๐‘1 and optimized using data from ๐‘1 . The nonconverted probability ๐‘Ž๐‘›๐‘๐‘๐‘ก1,2 . The subscript 1 indicates value of the metric for this optimal threshold is denoted by that ๐‘€1๐‘ก is used for the model, the subscript 2 that ๐’ž๐’ฎ ๐‘ก2 and ๐‘š๐‘’๐‘ก๐‘Ÿ๐‘–๐‘๐‘ก1,1 . ๐’ฐ๐’ฎ ๐‘ก2 are used for the consumers. We can do the same for Instead of training just one model and just one way of en- ๐‘1 , using ๐’ž๐’ฎ ๐‘ก1 and ๐’ฐ๐’ฎ ๐‘ก1 with ๐‘€1๐‘ก to compute ๐‘Ž๐‘๐‘๐‘ก1,1 and coding, we can also train multiple different model types and ๐‘Ž๐‘›๐‘๐‘๐‘ก1,1 respectively. In general, for loyalty campaigns different ways of encoding. We repeat the model training ๐‘๐‘– , ๐‘๐‘— and time ๐‘ก we have ๐‘Ž๐‘๐‘๐‘ก๐‘–,๐‘— = ๐œ‡๐‘’โˆˆ๐’ž๐’ฎ ๐‘ก๐‘— [๐‘€๐‘–๐‘ก (๐‘’)] and and threshold optimization for each combination of model ๐‘Ž๐‘›๐‘๐‘๐‘ก๐‘–,๐‘— = ๐œ‡๐‘’โˆˆ๐’ฐ ๐’ฎ ๐‘ก๐‘— [๐‘€๐‘–๐‘ก (๐‘’)]. Note that these values do not type and encoding type. Each combination then results in a necessarily say something about the model quality. On the value for the metric and we select the combination with the one hand, we expect ๐‘Ž๐‘๐‘๐‘ก1,2 to be high, since we know that highest value. This process is referred to as hyper-parameter it evaluates only positive consumers. On the other hand, optimization. We discuss the exact model types in Section 5, we cannot make an expectation for ๐‘Ž๐‘›๐‘๐‘๐‘ก1,2 , as it evaluates the different encoding types are beyond the scope of this pa- negative and possibly also positive consumers. One thing per. After selecting the best combination of model type and 4.1. Baseline thresholds For the baselines we consider values that at most use the training data, so we only require data from the first loyalty campaign. Without any alteration, a baseline of 0.5 is often used as standard. We further add the optimal threshold based on the training data, ๐‘œ๐‘กโ„Ž๐‘ก1,1 . 4.2. Heuristic thresholds For the heuristic thresholds we consider values that also make use of the second loyalty campaign, and which can be used without further knowledge. For this, we consider four linear formulas. The first, โˆ…๐‘Ž๐‘๐‘, multiplies the optimal thresholds from training, ๐‘œ๐‘กโ„Ž๐‘ก1,1 with the ratio between the average converted probabilities, ๐‘Ž๐‘๐‘๐‘ก1,2 /๐‘Ž๐‘๐‘๐‘ก1,1 . The idea behind this method is that if the average probability of the Figure 4: The original prediction task (white area with solid converted consumers has increased (or decreased), then the bounds) and how OPUS extends it (gray area with dashed bounds). model likely overestimates (underestimates) the probability The set of converted consumers ๐’ž๐’ฎ is now used to tell something of a consumer being positive. Similarly, we can also use about the model. the average non-converter probabilities for this through multiplying ๐‘œ๐‘กโ„Ž๐‘ก1,1 by ๐‘Ž๐‘›๐‘๐‘๐‘ก1,2 /๐‘Ž๐‘›๐‘๐‘๐‘ก1,2 . We refer to this as the โˆ…๐‘Ž๐‘›๐‘๐‘ method. Apart from the ratio, we can also take Table 3 the difference, adding ๐‘Ž๐‘๐‘๐‘ก1,2 โˆ’ ๐‘Ž๐‘๐‘๐‘ก1,1 to ๐‘œ๐‘กโ„Ž๐‘ก1,1 , which we All threshold methods evaluated in this paper refer to as โˆ†๐‘Ž๐‘๐‘. The โˆ†๐‘Ž๐‘›๐‘๐‘ method is defined in a similar Type Name Description way. Note that in principle any of these four may result in a value above 1, and the difference based heuristic method Regular No altered threshold (0.5) Base ๐‘œ๐‘กโ„Ž๐‘ก๐‘–,๐‘– Optimal threshold from training phase may result in a value below 0. This does not invalidate the threshold, but it means that all entities will be predicted to โˆ…๐‘Ž๐‘๐‘ ๐‘œ๐‘กโ„Ž๐‘ก๐‘–,๐‘– multiplied by the ratio of ๐‘Ž๐‘๐‘ be negative or positive, respectively. Heuristic โˆ…๐‘Ž๐‘›๐‘๐‘ ๐‘œ๐‘กโ„Ž๐‘ก๐‘–,๐‘– multiplied by the ratio of ๐‘Ž๐‘›๐‘๐‘ ฮ”๐‘Ž๐‘๐‘ ๐‘œ๐‘กโ„Ž๐‘ก๐‘–,๐‘– increased by the difference of ๐‘Ž๐‘๐‘ ฮ”๐‘Ž๐‘›๐‘๐‘ ๐‘œ๐‘กโ„Ž๐‘ก๐‘–,๐‘– increased by the difference of ๐‘Ž๐‘›๐‘๐‘ 4.3. Learned thresholds DT ๐‘‡ ๐‘ƒ ๐‘€๐‘–,๐‘— , with Decision Tree Regressor For the learned thresholds we need full information about TPM RF ๐‘‡ ๐‘ƒ ๐‘€๐‘–,๐‘— , with Random Forrest Regressor the second loyalty campaign, meaning we can only apply the learned thresholds to a third loyalty campaign. What the heuristic methods have in common is that they use one or to note is that for ๐‘Ž๐‘๐‘๐‘ก1,2 we only evaluate consumers that more of the ๐‘Ž๐‘๐‘, ๐‘Ž๐‘›๐‘๐‘ and ๐‘œ๐‘กโ„Ž values to make an estimation have already converted by time ๐‘ก in ๐‘2 . It might be that for about the target value (๐‘œ๐‘กโ„Ž๐‘ก1,2 ). For the final set of threshold ๐‘ก early in the loyalty campaign we are including only a very estimation methods, we train a regression model, which specific subset of eager consumers. Such consumers may we refer to as the threshold prediction model, or ๐‘‡ ๐‘ƒ ๐‘€ . If show different behaviour from consumers that converted needed, we refer to ๐‘€1๐‘ก as the consumer prediction model later in the loyalty campaign. As such, ๐‘Ž๐‘๐‘๐‘ก1,2 may not be an to distinguish between the two. ๐‘‡ ๐‘ƒ ๐‘€ takes as descriptive accurate representation of the average predicted probability space all known values at ๐‘ก: the (non-)converted probabili- of all positive consumers, ๐‘Ž๐‘๐‘๐‘ก1,2 ๐‘’ . This is not a problem ties ๐‘Ž๐‘๐‘๐‘ก1,1 , ๐‘Ž๐‘๐‘๐‘ก1,2 , ๐‘Ž๐‘›๐‘๐‘๐‘ก1,1 , ๐‘Ž๐‘›๐‘๐‘๐‘ก1,2 , the optimal training per se. First, the model is still trained on all consumers, it threshold ๐‘œ๐‘กโ„Ž๐‘ก1,1 , the training metric value ๐‘š๐‘’๐‘ก๐‘Ÿ๐‘–๐‘๐‘ก1,1 , the therefore also captures behaviour from the consumers that model specification (type and encoding) and the point in convert later. Second, we can compare ๐‘Ž๐‘๐‘๐‘ก1,2 to ๐‘Ž๐‘๐‘๐‘ก1,1 . As time ๐‘ก. This is represented by the red lines Figure 4. As such, both will be biased towards early redeemers if ๐‘ก is target space, we have the value of ๐‘œ๐‘กโ„Ž๐‘ก1,2 . We can sample small. Therefore, we expect to compare the same type of these values for different values of ๐‘ก and for each of the consumers. We can therefore use these values to compare different model specifications. Using two loyalty campaigns, ๐‘1 and ๐‘2 and adapt the prediction threshold. we create a training set for ๐‘‡ ๐‘ƒ ๐‘€ , with the target values As we cannot change ๐‘€1๐‘ก , the best we can do is changing coming from the second loyalty campaign and the descrip- the prediction threshold. The best value is the one which tive features from both loyalty campaigns. We denote this is computed based on the labelled sets of consumers of the ๐‘‡ ๐‘ƒ ๐‘€ as ๐‘‡ ๐‘ƒ ๐‘€1,2 , without the superscript ๐‘ก as multiple second loyalty campaign; ๐’ซ๐’ฎ 2 and ๐’ฉ ๐’ฎ 2 . In line with the values of ๐‘ก are used to train ๐‘‡ ๐‘ƒ ๐‘€1,2 . Note that we cannot previous indexing, this value is ๐‘œ๐‘กโ„Ž๐‘ก1,2 . We can only know use ๐‘‡ ๐‘ƒ ๐‘€1,2 to estimate a threshold to use during ๐‘2 , as ๐‘2 this value once ๐‘2 ends, so during the loyalty campaign we needs to be finished before ๐‘‡ ๐‘ƒ ๐‘€1,2 can be learned. OPUS need different ways of estimating it. Apart from the Baseline does not depend on the specific regressor used as a base methods, we differentiate between Heuristic thresholds and model for ๐‘‡ ๐‘ƒ ๐‘€ , in Section 5 we evaluate Decision Tree Learned thresholds. All methods are presented in Table 3. and Random Forest regressors. The idea behind the learned thresholds is that ๐‘‡ ๐‘ƒ ๐‘€1,2 learns how to adapt the prediction method (model and threshold) from one loyalty campaign to another. This means we could learn how the prediction method is best โ€˜transformedโ€™ from ๐‘1 to ๐‘2 , and then apply this learned the store, using aggregates based on [18] and descriptive transformation. Suppose that ๐‘1 rewards glassware, ๐‘2 re- labels based on [19] We finally adopt two strategies: with wards pans, and a third ๐‘3 rewards kitchen knives. If we and without adding the converted consumers in the test learn a transformation from glassware to pans, then we set to the training set. In either strategy, the prediction can assess 1) how well does it work on transforming the quality metrics are only computed on the non-converted prediction method for a different set of consumers from consumers in the test set. For each timestep we select the glassware to pans and 2) how well does that transformation hyper-parameter combination with the best performance apply to the transformation from pans to kitchen knives. In during the training phase using ๐‘œ๐‘กโ„Ž๐‘ก1,1 to report the results the evaluation in Section 5 we limit the evaluation to the for all threshold estimation methods. first due to the unavailability of additional data on a third For each combination of hyper-parameters we compute loyalty campaign. ๐‘Ž๐‘๐‘๐‘ก1,๐‘– , ๐‘Ž๐‘›๐‘๐‘๐‘ก1,๐‘– , and ๐‘œ๐‘กโ„Ž๐‘ก1,๐‘– for ๐‘– โˆˆ {1, 2} and each timestep ๐‘ก. On the one hand we can use this to compute all threshold 4.4. Existing Threshold Adaption techniques estimation methods discussed in Table 3. On the other hand we can use all combinations to train threshold prediction OPUS is not the first framework that optimizes thresholds. models. For each dataset ๐‘๐‘ฅ we compute two ๐‘‡ ๐‘ƒ ๐‘€ s, one In GHOST [14], the optimal threshold of any classifier is using a Decision Tree regressor, ๐ท๐‘‡๐‘ฅ , and one using a Ran- optimized on a validation set. The metric value for several dom Forest regressor ๐‘…๐น๐‘ฅ . These ๐‘‡ ๐‘ƒ ๐‘€ ๐‘  are then used in thresholds is computed and averaged over several stratified every other dataset ๐‘๐‘ฆ , ๐‘ฆ ฬธ= ๐‘ฅ, to estimate optimal thresh- subsets of a validation set. While this technique may prove olds. We also add two results for comparison. The first uses effective for finding an optimal threshold for the original the actual optimal threshold (๐‘œ๐‘กโ„Ž๐‘ก1,2 ) and the second uses training and validation data, there is no update in a later a prediction model learned on 75% of the data of loyalty stage, and such an update would require labelled data. As campaign ๐‘2 and tested on the remaining 25% (Retrain). such, GHOST is not suitable for the problem at hand. Simi- Both these methods violate the train-then-test principle but larly to GHOST, [15] proposes a method to find the optimal are added as a comparison to how close we can get. threshold in a training set. The authors prove that for such During the training phase, models are trained on a strati- a method only the probabilities of positive class are required fied 75% split of the consumers that visited the store during when optimizing for ๐น1 . While this technique potentially loyalty campaign ๐‘1 . The remaining 25% are used as test saves computation time, it is not universally applicable to set. During the inference phase, models are trained on all all metrics. In this paper we optimize the thresholds by consumers that visited the store during loyalty campaign evaluating all predicted probabilities, and selecting the one ๐‘2 . The consumers that visited the store by time ๐‘ก are used that optimizes the target metric. Finally, [16] proposes a as test set. The entire test set is used to compute ๐‘œ๐‘กโ„Ž๐‘ก1,๐‘– technique to adapt the separating hyperplane position of using the model trained on the training set. The converted SVMs to improve its predictive quality. OPUS is different (non-converted) consumers by time ๐‘ก in the test set are used from these because it tries to optimize the threshold using to compute ๐‘Ž๐‘๐‘๐‘ก1,๐‘– (๐‘Ž๐‘›๐‘๐‘๐‘ก1,๐‘– ). The non-converted consumers the new, partly labelled data in an online fashion. OPUS is by time ๐‘ก are used to compute the model performance for a designed to improve the predictions under concept drift. given threshold estimation method. Next to these methods, we also apply PU-learning defined by [5]. For this we use the same hyper-parameters as for 5. Experimental Evaluation the other experiments. We first find the best combination For the experimental evaluation1 we use data from a real- based on loyalty campaign ๐‘1 and then report the score for world retailer containing a total of about 160 000 consumers that hyper-parameter combination in ๐‘2 , computed over all shopping in exactly one of two consecutive loyalty cam- non-converted consumers at the respective point in time. paigns, ๐‘1 and ๐‘2 . We partition the consumers in ten sep- We apply the above for both the ๐น1 and Accuracy metrics. arate datasets ๐‘1 through ๐‘10 . For each dataset we build We present two types of results: those for timestep ๐‘ก = 4, several prediction models based on a grid-search over hyper- separated per dataset, and those for all timesteps, aggregated parameters and at different timesteps relative to the start of over the datasets. The reason for separately reporting ๐‘ก = 4 the loyalty campaign. comes from domain knowledge, this is the point in the Models are trained and tested at multiple points in time loyalty campaign where additional rewards for the loyalty relative to the loyalty campaign, after 2, 4, 6, . . . , 18 weeks campaign can still be ordered by the supermarket. into the campaign, ending before ๐‘ก๐‘’ . The encoding of con- sumers is based on the data available from the pre loyalty 5.1. Results for ๐‘ก = 4 campaign period and the data during the loyalty campaign Results are presented per split (one per row) and threshold up to ๐‘ก. We consider three hyper-parameters: the base estimation method (one per column) in Figure 5. We do model, the encoding technique, and a strategy. The base not report the values of the combinations where a ๐‘‡ ๐‘ƒ ๐‘€ is models are the consumer prediction models, for which we learned from the same split as the data it is tested on (the use ๐‘˜-nearest neighbours (๐‘˜NN), decision tree classifiers white diagonals). (DTs), support vector classifiers (SVCs) and Adaptive Hoeffd- Baseline methods For the baseline methods we see that ing Option Trees (AHOTs), all with their default parameters using the regular value is not much better than always pre- in Scikit-Learn for ๐‘˜NN, DTs, SVCs, and Scikit-Multiflow dicting True or False, almost all scores being 0.50, which is [17] for AHOTs, except for setting a constant seeding and effectively the worst value one can get for with balanced ac- training the SVCs on probabilities instead of labels. The en- curacy in a binary classification. Using the optimal training coding techniques define what type of features we use about threshold ๐‘œ๐‘กโ„Ž๐‘ก๐‘–,๐‘– works much better, with the exception of a consumer. The features are based on individual visits to ๐‘5 , ๐‘6 and ๐‘7 . 1 For the implementation, see github.com/YorickSpenrath/opus BL Heuristic TPM c0 .50 .82 .81 .82 .50 .82 .82 .83 .83 .83 .86 .83 .83 .82 .82 .82 .83 .86 .83 .83 .83 .83 .82 .71 .86 .86 .68 c1 .50 .80 .80 .80 .66 .80 .68 .80 .83 .83 .82 .82 .80 .80 .80 .82 .80 .83 .82 .80 .82 .80 .80 .80 .83 .85 .50 c2 .50 .82 .82 .82 .50 .82 .82 .82 .85 .82 .85 .85 .82 .82 .82 .82 .82 .82 .85 .82 .82 .82 .82 .82 .85 .87 c3 .55 .77 .78 .78 .84 .77 .84 .84 .84 .77 .77 .84 .78 .81 .84 .84 .84 .84 .81 .81 .81 .78 .78 .77 .84 .87 .82 c4 .50 .82 .82 .82 .72 .82 .52 .84 .82 .84 .84 .52 .82 .84 .82 .84 .84 .82 .84 .84 .84 .82 .84 .82 .85 .86 .51 c5 .50 .68 .81 .81 .81 .68 .81 .83 .85 .83 .81 .83 .68 .83 .83 .85 .85 .85 .85 .85 .85 .81 .83 .81 .85 .82 c6 .50 .68 .68 .81 .68 .68 .86 .86 .86 .81 .86 .68 .81 .68 .86 .81 .81 .86 .81 .86 .81 .81 .68 .81 .87 .88 .65 c7 .50 .66 .66 .66 .66 .66 .66 .84 .81 .84 .84 .84 .84 .84 .66 .84 .84 .84 .84 .84 .84 .84 .84 .81 .84 .87 c8 .50 .86 .86 .86 .50 .50 .86 .86 .86 .86 .77 .77 .86 .77 .77 .86 .86 .86 .86 .86 .86 .86 .86 .77 .86 .86 c9 .50 .80 .80 .80 .80 .80 .80 .80 .80 .80 .80 .80 .80 .80 .80 .80 .80 .80 .80 .80 .80 .80 .80 .80 .80 .85 .78 Regular Retrain acp ancp ancp othi,t i othi,t j RF0 RF1 RF2 RF3 RF4 RF5 RF6 RF7 RF8 RF9 acp PU DT0 DT1 DT2 DT3 DT4 DT5 DT6 DT7 DT8 DT9 Figure 5: Accuracy at timestep 4. Values shown in bold are strictly better than the ๐‘œ๐‘กโ„Ž๐‘ก๐‘–,๐‘– , the ones in italic are strictly smaller. Higher values correspond to a brighter colour. The white diagonals are skipped since those the ๐‘‡ ๐‘ƒ ๐‘€ would be trained on the same split we test it on. Some scores for PU-learning are left out; for these splits there were not enough converted consumers to train a meaningful prediction model. Heuristic methods For the heuristic methods we have describes the benefit of using the learned method in OPUS that, except for (๐‘8 , โˆ†๐‘Ž๐‘›๐‘๐‘), the ๐‘Ž๐‘›๐‘๐‘ methods consistently in general. We see what one can on average expect using performs as well or better than the baseline ๐‘œ๐‘กโ„Ž๐‘ก๐‘–,๐‘– . The ๐‘Ž๐‘๐‘ any other dataset to train the ๐‘‡ ๐‘ƒ ๐‘€ to estimate the opti- values do much worse however, especially for the โˆ†๐‘Ž๐‘๐‘. mal threshold is beneficial. What is more, mainly for the As explained in Section 4.2, if the difference between ๐‘Ž๐‘๐‘๐‘ก๐‘–,๐‘– Random Forest ๐‘‡ ๐‘ƒ ๐‘€ , we see that we get values that are and ๐‘Ž๐‘๐‘๐‘ก๐‘–,๐‘— is too high, this method might end up predicting close to using the actual optimal threshold ๐‘œ๐‘กโ„Ž๐‘ก๐‘–,๐‘— . For the all shoppers as redeemers or all shoppers as non-redeemers. heuristic thresholds, we see that the ratio-based ones (โˆ…) This is what happens for splits 0, 2 and 8. are performing close to or better than the ๐‘œ๐‘กโ„Ž๐‘ก๐‘–,๐‘– baseline. Learned methods With only a few exceptions, the For PU-learning we see that the results improve over time learned methods always perform at least as good as the as more positive labels are known. This is to be expected, baseline and often beat it. Furthermore, for most splits the though it does not yet catch with OPUS methods. best values even come close to the results for ๐‘œ๐‘กโ„Ž๐‘–,๐‘— , the F1 We see something similar to the accuracy results. This value which we aim to estimate. Next to this, we see that is expected, as we are penalizing in a similar way by com- the Random Forest ๐‘‡ ๐‘ƒ ๐‘€ performs almost always as well paring the actual with the predicted labels for each shopper. as or better than the Decision Tree based ๐‘‡ ๐‘ƒ ๐‘€ , compared There are two important differences. The first is that we on the same combination of splits. This is to be expected see lower values. This is because ๐น1 does not account for given the complexity of the Random Forest model and that class imbalance as we did for the balanced accuracy score. it can as such capture more difficult relations. Given that the redeemer class is much smaller than the non- PU-Learning One weakness of PU-learning is if the num- redeemer class, the reported values for ๐‘œ๐‘กโ„Ž๐‘ก๐‘–,๐‘— and Retrain ber of known positive labels is small. This results in a base are reasonable. This is also seen by the non ๐‘œ๐‘กโ„Ž๐‘ก๐‘–,๐‘– baselines, prediction model that cannot predict positive points, result- which have a much lower score. The second is that we are ing in an average probability of 0 for the known positive further off from the optimal thresholds ๐‘œ๐‘กโ„Ž๐‘ก๐‘–,๐‘— . This is likely labels. In such a scenario, PU-learning fails to produce any because of a similar reason: the ๐น1 is much more penalizing result. This is what happens in 4 of the splits. Note that for imbalanced datasets, and as such, being able to use the OPUS does not have this problem, as we are using the la- optimal thresholds/all labels allows for a much better score belled data from last loyalty campaign. In the splits where than estimating it using any of the methods from OPUS. there are enough known positive labels, it is still outper- Like accuracy, the results of PU-learning improve over time, formed by OPUS, for a similar reason. eventually even improving over OPUS methods. The lower accuracy and higher ๐น1 for PU-learning is likely caused by 5.2. Results aggregated per timestep the unbalanced dataset as well. We next aggregate the results over the splits for each timestep in Figures 6a and 6b. We report the mean and 6. Conclusion 95% confidence interval of the metric for every timestep (each row). For most estimation methods (each column) this In this paper we have presented OPUS, a framework to es- is the average over the 10 values from different splits, for the timate a better prediction threshold when only one class learned estimation methods we consider all combinations. of the target labels is available. The advantage of such In other words, the reported aggregated values for ๐ท๐‘‡ and optimizations is that we can adapt existing prediction mod- ๐‘…๐น are taken over all 90 combinations. els without having to rely on fully labelled data. This is a Accuracy Some results from Section 5.1 can be trans- more realistic assumption when the true label of one class ferred to more timesteps. The learned thresholds seem to is deferred. OPUS makes use of two types of threshold esti- perform better than using ๐‘œ๐‘กโ„Ž๐‘ก๐‘–,๐‘– . Note that, as we have mation methods, where learned methods are more effective averaged over both the splits used for the evaluation as in getting to an optimal prediction threshold at the cost of well as the splits used for computing the ๐‘‡ ๐‘ƒ ๐‘€ , this result requiring more data (only being available at a third cam- Baseline Heuristic TPM Comparison 4 .50 ยฑ.01 .77 ยฑ.04 .79 ยฑ.04 .80 ยฑ.03 .67 ยฑ.08 .74 ยฑ.06 .81 ยฑ.01 .82 ยฑ.01 .85 ยฑ.01 .86 ยฑ.01 .66 ยฑ.10 8 .53 ยฑ.02 .80 ยฑ.03 .80 ยฑ.03 .80 ยฑ.04 .65 ยฑ.10 .78 ยฑ.06 .81 ยฑ.01 .82 ยฑ.01 .84 ยฑ.02 .88 ยฑ.01 .73 ยฑ.03 12 .53 ยฑ.03 .78 ยฑ.04 .76 ยฑ.05 .76 ยฑ.05 .68 ยฑ.08 .79 ยฑ.04 .78 ยฑ.02 .80 ยฑ.02 .83 ยฑ.03 .86 ยฑ.01 .74 ยฑ.03 16 .60 ยฑ.04 .79 ยฑ.06 .77 ยฑ.07 .77 ยฑ.05 .72 ยฑ.09 .76 ยฑ.07 .80 ยฑ.01 .81 ยฑ.01 .83 ยฑ.03 .89 ยฑ.01 .73 ยฑ.04 (a) Accuracy 4 .03 ยฑ.03 .14 ยฑ.09 .17 ยฑ.09 .20 ยฑ.09 .21 ยฑ.09 .12 ยฑ.09 .25 ยฑ.03 .29 ยฑ.02 .34 ยฑ.03 .40 ยฑ.02 .15 ยฑ.08 8 .09 ยฑ.05 .19 ยฑ.08 .22 ยฑ.09 .20 ยฑ.08 .12 ยฑ.06 .20 ยฑ.08 .20 ยฑ.03 .22 ยฑ.03 .33 ยฑ.04 .39 ยฑ.03 .34 ยฑ.03 12 .06 ยฑ.05 .21 ยฑ.06 .18 ยฑ.07 .19 ยฑ.07 .15 ยฑ.06 .19 ยฑ.07 .20 ยฑ.02 .20 ยฑ.02 .25 ยฑ.05 .35 ยฑ.02 .32 ยฑ.01 16 .07 ยฑ.04 .11 ยฑ.05 .10 ยฑ.05 .09 ยฑ.05 .11 ยฑ.05 .11 ยฑ.05 .15 ยฑ.01 .17 ยฑ.01 .20 ยฑ.02 .31 ยฑ.03 .25 ยฑ.01 Regular DT RF Retrain acp ancp ancp othi,t i othi,t j acp PU (b) ๐น1 Figure 6: a) Accuracy and b) ๐น1 at each timestep, averaged over the splits with their 95% confidence interval. Values shown in bold are strictly better than the ๐‘œ๐‘กโ„Ž๐‘ก๐‘–,๐‘– , the ones in italic are strictly worse (in terms of the reported mean). Better values correspond to a brighter colour. paign), compared to heuristic methods that do not have robust ensemble approach to learn from positive and these benefits or limitations. While the initial results are unlabeled data using SVM base models, Neurocom- promising, we want to elaborate the experimental analysis, puting 160 (2015) 73โ€“84. both by including more than two loyalty campaigns and [9] J. Bekker, J. Davis, Learning from positive and unla- by applying OPUS on a different use case inside the student beled data: a survey, Machine Learning 109 (2020) learning domain. In the design of OPUS only characteristics 719โ€“760. that belong to the current timestep are considered, not of [10] P. De Abreu Lopes, H. De Arruda Camargo, Fuz- all previous timesteps. We argue that the progression of zStream: Fuzzy data stream clustering based on the these characteristics, as well as the model performance over online-offline framework, IEEE International Confer- time, and how they compare to the same period in the pre- ence on Fuzzy Systems (2017). vious loyalty campaign can be interesting to improve the [11] T. P. da Silva, V. M. A. Souza, G. E. A. P. A. Batista, framework in future work. H. de Arruda Camargo, A fuzzy classifier for data streams with infinitely delayed labels, LNCS 11401 (2019) 287โ€“295. References [12] K. B. Dyer, R. Capo, R. Polikar, Compose: A semisuper- vised learning framework for initially labeled nonsta- [1] W. Rizzi, C. Di Francescomarino, C. Ghidini, F. M. tionary streaming data, IEEE Transactions on Neural Maggi, How do I update my model? On the resilience Networks and Learning Systems 25 (2014) 12โ€“26. of Predictive Process Monitoring models to change, [13] B. Settles, Active Learning Literature Survey, Techni- KAIS 64 (2022) 1385โ€“1416. cal Report, University of Wisconsin-Madison, 2009. [2] C. Fahy, S. Yang, M. Gongora, Classification in Dy- [14] C. Esposito, G. A. Landrum, N. Schneider, N. Stiefl, namic Data Streams with a Scarcity of Labels, IEEE S. Riniker, GHOST: Adjusting the Decision Threshold TKDE (2021). to Handle Imbalanced Data in Machine Learning, Jour- [3] V. M. A. Souza, D. F. Silva, J. Gama, G. E. A. P. A. Batista, nal of Chemical Information and Modeling 61 (2021) Data Stream Classification Guided by Clustering on 2623โ€“2640. Nonstationary Environments and Extreme Verification [15] Q. Zou, S. Xie, Z. Lin, M. Wu, Y. Ju, Finding the Best Latency, in: SDM, 2015, pp. 873โ€“881. Classification Threshold in Imbalanced Classification, [4] N. J. Bombaij, S. Gelper, M. G. Dekimpe, Designing suc- Big Data Research 5 (2016) 2โ€“8. cessful temporary loyalty programs: An exploratory [16] H. Yu, C. Mu, C. Sun, W. Yang, X. Yang, X. Zuo, Support study on retailer and country differences, International vector machine-based optimized decision threshold Journal of Research in Marketing 39 (2022) 1275โ€“1295. adjustment strategy for classifying imbalanced data, [5] C. Elkan, K. Noto, Learning classifiers from only posi- Knowledge-Based Systems 76 (2015) 67โ€“78. tive and unlabeled data, KDD (2008) 213โ€“220. [17] J. Montiel, J. Read, A. Bifet, T. Abdessalem, Scikit- [6] J. Peeperkorn, C. O. Vรกzquez, A. Stevens, J. D. Smedt, Multiflow: A Multi-output Streaming Framework, S. Broucke, J. D. Weerdt, Outcome-Oriented Predictive JMLR 19 (2018) 1โ€“5. Process Monitoring on Positive and Unlabelled Event [18] Y. Spenrath, M. Hassani, B. F. van Dongen, Online Logs, ICPM workshops (2022). prediction of aggregated retailer consumer behaviour, [7] J. Bekker, P. Robberechts, J. Davis, Beyond the Selected ICPM Workshops (2021). Completely at Random Assumption for Learning from [19] Y. Spenrath, M. Hassani, B. v. Dongen, H. Tariq, Learn- Positive and Unlabeled Data, LNCS 11907 LNAI (2020) ing an efficient distance metric for retailer transaction 71โ€“85. data, in: PKDD, 2020, pp. 323โ€“338. [8] M. Claesen, F. De Smet, J. A. Suykens, B. De Moor, A