<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Online Prediction Threshold Optimization Under Semi-deferred Labelling</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Yorick Spenrath</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Marwan Hassani</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Boudewijn F. van Dongen</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Process Analytics Group, Faculty of Mathematics and Computer Science, Eindhoven University of Technology</institution>
          ,
          <country country="NL">The Netherlands</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>In supermarket loyalty campaigns, shoppers collect stamps to redeem limited-time luxury products. Having an accurate prediction of which shoppers will eventually redeem is crucial to efective execution. With the ultimate goal of changing shopper behavior, it is important to ensure an adequate number of rewards and to be able to steer promising shoppers into joining the campaign and redeeming a reward. If information from previous campaigns is available, a prediction model can be built to predict the redemption probability, possibly also adapting the prediction threshold to determine predicted the label. During a running campaign, we only know a subset of the labels of the positive class (the so-far redeemers), and have no access to the labels of any example of the negative class (non-redeemers at the end of the campaign). The majority of the examples during the campaign do not have a label yet (shoppers that could still redeem but have not done so yet). This is a semi-deferred labelling setting and our goal is to improve the prediction quality using this limited information. Existing work on predicting (semi-deferred) labels either focuses on positive-unlabelled learning, which does not use existing models, or updates models after the prediction is made by assigning expected labels using unsupervised learning models. In this paper we present a framework for Online Prediction threshold optimization Under Semi-deferred labelling (OPUS). Our framework does not change the existing model, but instead adapts the prediction threshold that decides which probability is required for a positive label, based on the semi-deferred labels we already know. We apply OPUS to two real-world datasets: a supermarket with two campaigns and over 160 000 shoppers.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Prediction threshold</kwd>
        <kwd>Online</kwd>
        <kwd>Semi-deferred labels</kwd>
        <kwd>Supermarket loyalty campaign</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>Traditional supervised machine learning projects start with
a set of labelled data. Specifically in binary classification,
data belongs to one of two classes: positive and negative.
Starting from a set of examples with their ground truth label,
a predictor is created, which can assess the probability that
an unseen example belongs to the positive class. Without
optimization, an example is predicted to be part of the
positive class if this probability is at least 0.5. One improvement
is to select a diferent prediction threshold that distinguishes
examples between the positive and negative class on their
probability. In the presence of labelled data, such a threshold
can be picked in a way that the predicted labels maximize
a given metric. Provided that no concept drift occurs, the
model with the adapted threshold can be used indefinitely
without degrading prediction quality. In this scenario, we
only use the ‘ofline training’ (I) of Figure 1.</p>
      <p>
        This assumption on the data is however not realistic since
changes in the data distribution such as concept drift can
reduce the quality of the model. A common technique is to
retrain or update the model at a later point in time, when
more recent labelled data is available [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] and potentially
also updating the prediction threshold. This training is
referred to as online training, where new labelled examples
are received and can be used to update the model and/or
prediction threshold. Such techniques assume that the label
latency, that is the time between seeing an example and its
true label, is small. Under these conditions prediction
systems can react to concept drift as it occurs. This assumption
may not be valid in all real-world scenarios, for example
because it is computationally or labour intensive to do so [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
0 t
Offline training Only + labels More + labels
te
      </p>
      <p>All labels</p>
      <p>
        If the latency becomes too high, models can no longer
properly be corrected for concept drift, or not even be updated
at all if the latency is infinite [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
      </p>
      <p>In this paper we consider a special class of the latter,
where we know for some examples that they are positive
and do not know the label of the other examples (they may
be either positive or negative). This is sketched in Figure 1.
At time , we have the (potentially outdated) ofline training
data. We have further seen new examples, of which we
know some are positive. It is however not until time  that
we know the actual labels of all examples, and up to that
point we only receive the actual label of positive examples.
We refer to this as semi-deferred labelling, as we have some
labelled examples, but only from one class. This still does
not allow us to update or retrain the model, neither does
it allow us to set a new threshold that optimizes our target
metric. Setting this new threshold is important, since we
still want to make a prediction for all unlabelled examples.
We therefore introduce OPUS, Online Prediction threshold
optimization Under Semi-deferred labelling, which aims to
set a better threshold for an existing prediction model, based
on the limited available positive examples. Note that this
is not the same as imbalanced training, as we have only a
single class of up-to-date data and do not make assumptions
about the possibly outdated data.</p>
      <p>
        Even though OPUS can be generalized to other scenarios,
we discuss a supermarket one in this paper. Shoppers in this
supermarket make use of loyalty cards which identify them
at each purchase they make. Once a year, the supermarket
holds a so-called loyalty campaign [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. During the campaign,
shoppers may collect stamps. These stamps can be collected
by spend (for example one stamp for every 10 monetary
units) or in special promotions (an extra stamp for a certain
product). If a shopper has collected enough stamps they can
purchase a reward. Rewards are usually luxurious products
available at a much lower (sometimes only symbolic) price.
Shoppers are as such persuaded to participate in the loyalty
campaign. In this work we consider campaigns with a
limited time scope: shoppers can only collect the stamps and
redeem rewards within the duration of the campaign. For
such a campaign to work, the supermarket must accurately
predict how many rewards are required, such that each
participating shopper can actually get their rewards. Having
too few rewards means disappointing shoppers, having too
many means there will be unusable stock left. It is beneficial
to make predictions about what shoppers will do during
the rest of the campaign, in an efort to either steer their
decisions or adapt the available rewards later on.
      </p>
      <p>For our prediction model, we are interested in whether
a shopper will make a redemption: shoppers have positive
labels if they redeem a reward during the campaign, or
negative labels otherwise. Consider two campaigns, 1 and
2. During 2 we want to predict whether a shopper will
redeem, based on a classifier we learned from the finished
campaign 1. We continuously receive new data and make
new predictions and the end of every week. In practice this
means we get new positive examples in batches, these are all
consumers that first redeemed in the preceding week. Let 
be the moment of prediction, since the start of 2. We train
a classifier and select the best threshold. To do so, we split
the data of campaign 1 into a train and validation set, and
train several models on the former and adjust the threshold
for each model to maximize a target metric using the latter.
One of these models has the highest metric value and we
select that as our prediction model. We retrain that model
on all shoppers of campaign 1, and test it on campaign
2, using the corresponding learned optimal threshold. The
problem with this approach is that the model trained on
campaign 1 might not work as well on campaign 2. We
also do not get new negative labels during 2, which means
that updating our prediction model or creating a new one is
not feasible. The best we can do is to adapt the prediction
threshold based on the limited labels we have available for
2. This is the main contribution of this paper, OPUS is a
framework that adapts this threshold, based on the new
labelled data.</p>
      <p>The remainder of this paper is structured as follows. In
Section 2 we discuss existing solutions to the semi-deferred
prediction problem. In Section 3 we formalize the prediction
problem for loyalty campaign participation. In Section 4
we then discuss OPUS. We finally evaluate OPUS against
alternative methods in Section 5 and conclude the paper in
Section 6.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>For this problem there are roughly three existing categories
of solutions. These are schematically presented in Figure 2.
The first is retraining the classifier with an extended training
set, where the new positive examples are added. This can
either have limited efect if the number of positive labels is
small or it can heavily bias the prediction model towards the
positive class if the number of positive labels is high. The
second category builds a model based on the positive and
unlabelled state of the examples. The third is to ignore the
positive labels, and assume all new data is unlabelled. All
these techniques make a new prediction model or update
the existing one. We discuss the second and third categories
below, including their specific disadvantages in the
semideferred labelling problem.</p>
      <sec id="sec-2-1">
        <title>2.1. PU-learning</title>
        <p>
          The original PU-learning solution was proposed in [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ]. Here
a classifier is build using positive and unlabelled examples,
where the class of the unlabelled examples is not known.
Semi-deferred labelling is less restrictive in two ways. The
ifrst is that we do have a, potentially outdated, set of
labelled data. For the supermarkets, this is a previous loyalty
campaign. The second is that we do not have a single set
of positive examples but rather receive additional positive
examples later on. Because Semi-deferred labelling is less
restrictive, PU-learning can still be applied to our problem
at diferent points in time, though it might not be as efective
because it ignores part of the data.
        </p>
        <p>
          A solution to PU-learning was originally proposed in [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ].
Given a set of examples, of which only a fraction is labelled,
and only from a single class (positive), the authors reason
that the conditional probability that an example belongs
to the positive class is equal to the conditional probability
that the example is labelled, divided by the probability a
positive sample is labelled. From this reasoning, one can
construct a classifier on whether a sample is labelled and
use that to predict the probability that positive samples are
labelled (as average of the classifier outcome for the labelled
examples). For the prediction of the unlabelled examples,
the probability given by the classifier is divided by this latter
value. An extension of this is discussed in [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ], where some
of its negative efects of misclassification are discussed.
PUlearning usually assumes that the examples are labelled
completely at random. In [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] this assumption is abandoned,
where the probability of an example being labelled depends
on its attributes. This is closer to our specific scenario, as
consumers that are labelled as positive earlier in the loyalty
campaign may have diferent characteristics than consumers
labelled later in the loyalty campaign. In [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ], the assumption
that all labels are accurate is dropped: the authors propose a
robust method of an ensemble of Support Vector Machines
(SVMs) that can deal with the impurity of positive examples.
For loyalty campaign participation prediction, this is not
applicable, we have an exact definition of positive labels. In
addition to these works, a more general overview of
PUlearning is discussed in [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ].
        </p>
        <p>What all these techniques have in common is that existing
predictions models are not used or changed. While this does
not invalidate their use in semi-deferred labelling, their more
restrictive assumption does not allow them to benefit from
the existing data.</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Ignoring new labels</title>
        <p>
          A more restrictive assumption on the labels is that no labels
are available, short of an initial training set. This would
be comparable to ignoring newly converted consumers in
the second loyalty campaign. Research dealing with such
scenarios is often done in a streaming setting, either by
predicting labels based on clustering techniques, or using
semisupervised learning methods. In [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ], the authors propose a
framework called SCARG. New examples are first classified
by an existing classifier. After enough new examples are
seen, new clusters are formed. The clusters are matched
to existing clusters by their medoids and as such given a
(new) label. Points in the clusters are then used to create a
new prediction model. A similar approach using fuzzy
clustering [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ] is made in [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ]. Instead of making predictions
using an existing classifier, the COMPOSE framework
discussed in [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ] uses a semi-supervised approach. Assuming
an initial set of labels, new unlabelled examples are
classiifed in a semi-supervised manner. Next, only a selection
of these examples are kept for the next timestep, at which
new unlabelled examples are again classified using the
semisupervised learner. The main contribution of COMPOSE
is the way in which the kept examples are selected: using
 -shapes which are compacted until a desired number of
examples remain.
        </p>
        <p>The commonality in these approaches is that they focus
on a streaming setting. At specific points in time, one or
more existing models are updated using estimated labels
from the respective solution. Although the authors show the
efectiveness in dealing with streaming data by seeing many
new examples at diferent timesteps, there are three key
differences that make these methods unsuitable for our setting.
First, we do not have a constant stream of new examples,
instead we see new information about existing examples at
discrete points in time. Second, we are only interested in
improving a single model once, as we evaluate the existing
model several weeks into the loyalty campaign. Third, we
do not expect the drift to be gradual; we are starting a whole
new loyalty campaign a year later, so our expected type of
drift is more of the abrupt kind. If we were to apply SCARG
or fuzzy clustering to predict loyalty campaign
participation it would efectively be a more elaborate NN classifier.
COMPOSE is not applicable at all, since there is no way of
knowing the true label of non-participating consumers until
the end of the loyalty campaign.</p>
        <p>
          Another solution to unavailability to unlabelled data is
active learning, such as in [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ]. Instead of having no labels,
learners can query selected labels from domain experts. An
example of dealing with such conditions in a multiclass
setting is COCEL [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ]. An ensemble of one-class classifiers is
kept, each predicting only whether an example belongs to
its respective class. If none of the classifiers recognizes the
new example as one of their class, the example is added to a
bufer. Clusters in this bufer are labelled by domain experts.
Similar to COMPOSE, such solutions are not applicable in
our scenario, as we cannot know the actual label until the
end of the loyalty campaign.
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Prediction Task</title>
      <p>Before we explain how OPUS works, we first discuss the
prediction task itself. All symbols used are summarized in
Table 1. As introduced in Section 1, we want to solve the
problem: “Given the data related to a consumer at time 
during the loyalty campaign can we predict whether they
will participate in the loyalty campaign?”. This is a problem
of binary classification: for each consumer we want to
determine whether they are a member of one of two classes: the
positive class (participant) or negative class (no participant).
One way to do this is to use a classifier with a prediction
threshold. A classifier is a function that assigns a probability
to the encoding of a consumer. If the probability exceeds the
prediction threshold, the consumer is predicted to belong to
the positive class, otherwise they are predicted to belong to
the negative class. While classifiers can usually be trained to
make a prediction between more than two classes, we only
consider binary classifiers in this paper which, without loss
of generalization, predict the probability that a consumer is
part of a positive class.</p>
      <p>
        Let  :  → [
        <xref ref-type="bibr" rid="ref1">0, 1</xref>
        ] be a function with  the set of
all consumers.  is a binary classifier that estimates the
probability that a consumer  belongs to the positive class.
For a given prediction threshold ℎ ∈ [
        <xref ref-type="bibr" rid="ref1">0, 1</xref>
        ], consumer 
is predicted to belong to the positive class if and only if
 () &gt; ℎ. The binary classifier and the prediction
threshold together assign a predicted binary class, this is the binary
classification of the consumer. In this paper, the focus is
on determining the threshold. However, before we do so,
we explain what data is used to train the classifier for the
prediction task. This is presented schematically in Figure 3.
      </p>
      <sec id="sec-3-1">
        <title>3.1. Training phase: the first loyalty campaign</title>
        <p>We make a prediction at time  relative to the start of the
loyalty campaign, which ends at . We start by learning
a classifier from a previous loyalty campaign. For 1 we
know which class each consumer that visited the store
during 1 belongs to, this is whether they redeemed a reward
(1) or not ( 1). The 1 in the subscript indicates that
these consumer sets belong to loyalty campaign 1. We
select a stratified sample of 75% from this set to train a model
and later use the remaining 25% to optimize the threshold.
Many diferent types of classifiers exist, but in general the
training of a classifier involves finding a function that
maximizes the predicted probability for consumers that belong
to the positive class and minimizing it for consumers that
belong to the negative class.</p>
        <p>The classifier that is trained on data up to time  is
referred to as  1. The reason is that we want to use it at time
 in 2, so only data up to that time in 1 should be used
to train the classifier, to make the encoding of consumers
as similar as possible. This is indicated by the superscript .
After training a classifier, we select a prediction threshold.
If we set the threshold lower, more consumers will be
predicted as participants, if we set the threshold higher, fewer
consumers will be predicted as participants. For a given
binary classification metric and a set of consumers, there is
a threshold that can be considered the best, as it results in
the best metric score. We call this the optimal threshold or
ℎ1 ,1. The consumers used to optimize the threshold come
from the 25% that was not used in training. The double 1
subscript indicates that it is the optimal threshold for the
model trained on 1 and optimized using data from 1. The
value of the metric for this optimal threshold is denoted by
1 ,1.</p>
        <p>Instead of training just one model and just one way of
encoding, we can also train multiple diferent model types and
diferent ways of encoding. We repeat the model training
and threshold optimization for each combination of model
type and encoding type. Each combination then results in a
value for the metric and we select the combination with the
highest value. This process is referred to as hyper-parameter
optimization. We discuss the exact model types in Section 5,
the diferent encoding types are beyond the scope of this
paper. After selecting the best combination of model type and
encoding type, we train a final model, which now includes
all consumers from 1, and keep the previously computed
threshold of that combination. With this new model and the
optimal threshold, we can make predictions in the following
loyalty campaign, 2.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Inference phase: the second loyalty campaign</title>
        <p>With the trained model, we can start making predictions
during the second loyalty campaign. At point  in 2 we
only have data available up to point , so that is what we
use in the encoding of consumers. We can identify two sets
of consumers. The first are those who have redeemed and
have a positive label: 2 . The second are those who have

not (yet) redeemed and may or may not still do so:  2.
Both are indexed by , as explained in Section 3 the sets
change over time. Note that 2 ⊆  2,  2 ⊆   2
and 2 ∪  2 = 2. Since we cannot distinguish between

the latter two, we make predictions for them using  1, and
we assign the predicted labels using the learned ℎ1 ,1. With
the predicted and actual labels (which are available when 2
ends), we can then compute the classification metric. This is
visually presented in Figure 4 in the white area. As discussed
previously, we are not able to update the prediction model
 1 itself. However, we can change the threshold we use to
assign the predicted class from the predicted probabilities
which is the topic of Section 4.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. OPUS framework</title>
      <p>In this section, we sketch the idea behind OPUS and then
formalize it. Table 2 summarizes the new symbols. An
overview of how OPUS adds to the existing prediction
problem is shown in Figure 4.</p>
      <p>We start from 2 and  2. As we do not know the

labels of the latter, we cannot use these consumers to update
 1. We can however compute two characteristics about
the current loyalty campaign 2: the average probability
computed by  1 to each of them. We refer to these as
the average converted probability 1 ,2 and the average
nonconverted probability 1 ,2. The subscript 1 indicates
that  1 is used for the model, the subscript 2 that 2 and
 2 are used for the consumers. We can do the same for
1, using 1 and  1 with  1 to compute 1 ,1 and
1 ,1 respectively. In general, for loyalty campaigns
  [()] and
,  and time  we have , =  ∈</p>
      <p>[()]. Note that these values do not
, =  ∈
necessarily say something about the model quality. On the
one hand, we expect 1 ,2 to be high, since we know that
it evaluates only positive consumers. On the other hand,
we cannot make an expectation for 1 ,2, as it evaluates
negative and possibly also positive consumers. One thing
ℎ</p>
      <p>, multiplied by the ratio of 
ℎ</p>
      <p>, multiplied by the ratio of 
ℎ</p>
      <p>, increased by the diference of 
ℎ</p>
      <p>, increased by the diference of 
  , , with Decision Tree Regressor
  , , with Random Forrest Regressor
to note is that for 1 ,2 we only evaluate consumers that
have already converted by time  in 2. It might be that for
 early in the loyalty campaign we are including only a very
specific subset of eager consumers. Such consumers may
show diferent behaviour from consumers that converted
later in the loyalty campaign. As such, 1 ,2 may not be an
accurate representation of the average predicted probability
of all positive consumers, 1,2. This is not a problem
per se. First, the model is still trained on all consumers, it
therefore also captures behaviour from the consumers that
convert later. Second, we can compare 1 ,2 to 1 ,1. As
such, both will be biased towards early redeemers if  is
small. Therefore, we expect to compare the same type of
consumers. We can therefore use these values to compare
1 and 2 and adapt the prediction threshold.</p>
      <p>As we cannot change  1, the best we can do is changing
the prediction threshold. The best value is the one which
is computed based on the labelled sets of consumers of the
second loyalty campaign; 2 and  2. In line with the
previous indexing, this value is ℎ1 ,2. We can only know
this value once 2 ends, so during the loyalty campaign we
need diferent ways of estimating it. Apart from the Baseline
methods, we diferentiate between Heuristic thresholds and
Learned thresholds. All methods are presented in Table 3.</p>
      <sec id="sec-4-1">
        <title>4.1. Baseline thresholds</title>
        <p>For the baselines we consider values that at most use the
training data, so we only require data from the first loyalty
campaign. Without any alteration, a baseline of 0.5 is often
used as standard. We further add the optimal threshold
based on the training data, ℎ1 ,1.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Heuristic thresholds</title>
        <p>For the heuristic thresholds we consider values that also
make use of the second loyalty campaign, and which can
be used without further knowledge. For this, we consider
four linear formulas. The first, ∅, multiplies the optimal
thresholds from training, ℎ1 ,1 with the ratio between the
average converted probabilities, 1 ,2/1 ,1. The idea
behind this method is that if the average probability of the
converted consumers has increased (or decreased), then the
model likely overestimates (underestimates) the probability
of a consumer being positive. Similarly, we can also use
the average non-converter probabilities for this through
multiplying ℎ1 ,1 by 1 ,2/1 ,2. We refer to this as
the ∅ method. Apart from the ratio, we can also take
the diference, adding 1 ,2 − 1 ,1 to ℎ1 ,1, which we
refer to as ∆ . The ∆  method is defined in a similar
way. Note that in principle any of these four may result in
a value above 1, and the diference based heuristic method
may result in a value below 0. This does not invalidate the
threshold, but it means that all entities will be predicted to
be negative or positive, respectively.</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Learned thresholds</title>
        <p>For the learned thresholds we need full information about
the second loyalty campaign, meaning we can only apply
the learned thresholds to a third loyalty campaign. What the
heuristic methods have in common is that they use one or
more of the ,  and ℎ values to make an estimation
about the target value (ℎ1 ,2). For the final set of threshold
estimation methods, we train a regression model, which
we refer to as the threshold prediction model, or    . If
needed, we refer to  1 as the consumer prediction model
to distinguish between the two.    takes as descriptive
space all known values at : the (non-)converted
probabilities 1 ,1, 1 ,2, 1 ,1, 1 ,2, the optimal training
threshold ℎ1 ,1, the training metric value 1 ,1, the
model specification (type and encoding) and the point in
time . This is represented by the red lines Figure 4. As
target space, we have the value of ℎ1 ,2. We can sample
these values for diferent values of  and for each of the
diferent model specifications. Using two loyalty campaigns,
we create a training set for    , with the target values
coming from the second loyalty campaign and the
descriptive features from both loyalty campaigns. We denote this
   as   1,2, without the superscript  as multiple
values of  are used to train   1,2. Note that we cannot
use   1,2 to estimate a threshold to use during 2, as 2
needs to be finished before   1,2 can be learned. OPUS
does not depend on the specific regressor used as a base
model for    , in Section 5 we evaluate Decision Tree
and Random Forest regressors.</p>
        <p>The idea behind the learned thresholds is that   1,2
learns how to adapt the prediction method (model and
threshold) from one loyalty campaign to another. This
means we could learn how the prediction method is best
‘transformed’ from 1 to 2, and then apply this learned
transformation. Suppose that 1 rewards glassware, 2
rewards pans, and a third 3 rewards kitchen knives. If we
learn a transformation from glassware to pans, then we
can assess 1) how well does it work on transforming the
prediction method for a diferent set of consumers from
glassware to pans and 2) how well does that transformation
apply to the transformation from pans to kitchen knives. In
the evaluation in Section 5 we limit the evaluation to the
ifrst due to the unavailability of additional data on a third
loyalty campaign.</p>
      </sec>
      <sec id="sec-4-4">
        <title>4.4. Existing Threshold Adaption techniques</title>
        <p>
          OPUS is not the first framework that optimizes thresholds.
In GHOST [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ], the optimal threshold of any classifier is
optimized on a validation set. The metric value for several
thresholds is computed and averaged over several stratified
subsets of a validation set. While this technique may prove
efective for finding an optimal threshold for the original
training and validation data, there is no update in a later
stage, and such an update would require labelled data. As
such, GHOST is not suitable for the problem at hand.
Similarly to GHOST, [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ] proposes a method to find the optimal
threshold in a training set. The authors prove that for such
a method only the probabilities of positive class are required
when optimizing for 1. While this technique potentially
saves computation time, it is not universally applicable to
all metrics. In this paper we optimize the thresholds by
evaluating all predicted probabilities, and selecting the one
that optimizes the target metric. Finally, [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ] proposes a
technique to adapt the separating hyperplane position of
SVMs to improve its predictive quality. OPUS is diferent
from these because it tries to optimize the threshold using
the new, partly labelled data in an online fashion. OPUS is
designed to improve the predictions under concept drift.
        </p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Experimental Evaluation</title>
      <p>For the experimental evaluation1 we use data from a
realworld retailer containing a total of about 160 000 consumers
shopping in exactly one of two consecutive loyalty
campaigns, 1 and 2. We partition the consumers in ten
separate datasets 1 through 10. For each dataset we build
several prediction models based on a grid-search over
hyperparameters and at diferent timesteps relative to the start of
the loyalty campaign.</p>
      <p>
        Models are trained and tested at multiple points in time
relative to the loyalty campaign, after 2, 4, 6, . . . , 18 weeks
into the campaign, ending before . The encoding of
consumers is based on the data available from the pre loyalty
campaign period and the data during the loyalty campaign
up to . We consider three hyper-parameters: the base
model, the encoding technique, and a strategy. The base
models are the consumer prediction models, for which we
use -nearest neighbours (NN), decision tree classifiers
(DTs), support vector classifiers (SVCs) and Adaptive
Hoefding Option Trees (AHOTs), all with their default parameters
in Scikit-Learn for NN, DTs, SVCs, and Scikit-Multiflow
[
        <xref ref-type="bibr" rid="ref17">17</xref>
        ] for AHOTs, except for setting a constant seeding and
training the SVCs on probabilities instead of labels. The
encoding techniques define what type of features we use about
a consumer. The features are based on individual visits to
1For the implementation, see github.com/YorickSpenrath/opus
the store, using aggregates based on [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ] and descriptive
labels based on [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ] We finally adopt two strategies: with
and without adding the converted consumers in the test
set to the training set. In either strategy, the prediction
quality metrics are only computed on the non-converted
consumers in the test set. For each timestep we select the
hyper-parameter combination with the best performance
during the training phase using ℎ1 ,1 to report the results
for all threshold estimation methods.
      </p>
      <p>For each combination of hyper-parameters we compute
1 ,, 1 ,, and ℎ1 , for  ∈ {1, 2} and each timestep
. On the one hand we can use this to compute all threshold
estimation methods discussed in Table 3. On the other hand
we can use all combinations to train threshold prediction
models. For each dataset  we compute two    s, one
using a Decision Tree regressor, , and one using a
Random Forest regressor . These     are then used in
every other dataset ,  ̸= , to estimate optimal
thresholds. We also add two results for comparison. The first uses
the actual optimal threshold (ℎ1 ,2) and the second uses
a prediction model learned on 75% of the data of loyalty
campaign 2 and tested on the remaining 25% (Retrain).
Both these methods violate the train-then-test principle but
are added as a comparison to how close we can get.</p>
      <p>During the training phase, models are trained on a
stratiifed 75% split of the consumers that visited the store during
loyalty campaign 1. The remaining 25% are used as test
set. During the inference phase, models are trained on all
consumers that visited the store during loyalty campaign
2. The consumers that visited the store by time  are used
as test set. The entire test set is used to compute ℎ1 ,
using the model trained on the training set. The converted
(non-converted) consumers by time  in the test set are used
to compute 1 , (1 ,). The non-converted consumers
by time  are used to compute the model performance for a
given threshold estimation method.</p>
      <p>
        Next to these methods, we also apply PU-learning defined
by [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. For this we use the same hyper-parameters as for
the other experiments. We first find the best combination
based on loyalty campaign 1 and then report the score for
that hyper-parameter combination in 2, computed over all
non-converted consumers at the respective point in time.
      </p>
      <p>We apply the above for both the 1 and Accuracy metrics.
We present two types of results: those for timestep  = 4,
separated per dataset, and those for all timesteps, aggregated
over the datasets. The reason for separately reporting  = 4
comes from domain knowledge, this is the point in the
loyalty campaign where additional rewards for the loyalty
campaign can still be ordered by the supermarket.
5.1. Results for  = 4
Results are presented per split (one per row) and threshold
estimation method (one per column) in Figure 5. We do
not report the values of the combinations where a    is
learned from the same split as the data it is tested on (the
white diagonals).</p>
      <p>Baseline methods For the baseline methods we see that
using the regular value is not much better than always
predicting True or False, almost all scores being 0.50, which is
efectively the worst value one can get for with balanced
accuracy in a binary classicfiation. Using the optimal training

threshold ℎ, works much better, with the exception of
5, 6 and 7.
p
c
a
p
c
a</p>
      <p>Heuristic methods For the heuristic methods we have
that, except for (8, ∆ ), the  methods consistently
performs as well or better than the baseline ℎ,. The 
values do much worse however, especially for the ∆ .

As explained in Section 4.2, if the diference between ,

and , is too high, this method might end up predicting
all shoppers as redeemers or all shoppers as non-redeemers.
This is what happens for splits 0, 2 and 8.</p>
      <p>Learned methods With only a few exceptions, the
learned methods always perform at least as good as the
baseline and often beat it. Furthermore, for most splits the
best values even come close to the results for ℎ, , the
value which we aim to estimate. Next to this, we see that
the Random Forest    performs almost always as well
as or better than the Decision Tree based    , compared
on the same combination of splits. This is to be expected
given the complexity of the Random Forest model and that
it can as such capture more dificult relations.</p>
      <p>PU-Learning One weakness of PU-learning is if the
number of known positive labels is small. This results in a base
prediction model that cannot predict positive points,
resulting in an average probability of 0 for the known positive
labels. In such a scenario, PU-learning fails to produce any
result. This is what happens in 4 of the splits. Note that
OPUS does not have this problem, as we are using the
labelled data from last loyalty campaign. In the splits where
there are enough known positive labels, it is still
outperformed by OPUS, for a similar reason.</p>
      <sec id="sec-5-1">
        <title>5.2. Results aggregated per timestep</title>
        <p>We next aggregate the results over the splits for each
timestep in Figures 6a and 6b. We report the mean and
95% confidence interval of the metric for every timestep
(each row). For most estimation methods (each column) this
is the average over the 10 values from diferent splits, for the
learned estimation methods we consider all combinations.
In other words, the reported aggregated values for  and
 are taken over all 90 combinations.</p>
        <p>Accuracy Some results from Section 5.1 can be
transferred to more timesteps. The learned thresholds seem to
perform better than using ℎ,. Note that, as we have
averaged over both the splits used for the evaluation as
well as the splits used for computing the    , this result
describes the benefit of using the learned method in OPUS
in general. We see what one can on average expect using
any other dataset to train the    to estimate the
optimal threshold is beneficial. What is more, mainly for the
Random Forest    , we see that we get values that are
close to using the actual optimal threshold ℎ, . For the
heuristic thresholds, we see that the ratio-based ones (∅)

are performing close to or better than the ℎ, baseline.
For PU-learning we see that the results improve over time
as more positive labels are known. This is to be expected,
though it does not yet catch with OPUS methods.</p>
        <p>F1 We see something similar to the accuracy results. This
is expected, as we are penalizing in a similar way by
comparing the actual with the predicted labels for each shopper.
There are two important diferences. The first is that we
see lower values. This is because 1 does not account for
class imbalance as we did for the balanced accuracy score.
Given that the redeemer class is much smaller than the
non
redeemer class, the reported values for ℎ, and Retrain

are reasonable. This is also seen by the non ℎ, baselines,
which have a much lower score. The second is that we are
further of from the optimal thresholds ℎ, . This is likely
because of a similar reason: the 1 is much more penalizing
for imbalanced datasets, and as such, being able to use the
optimal thresholds/all labels allows for a much better score
than estimating it using any of the methods from OPUS.
Like accuracy, the results of PU-learning improve over time,
eventually even improving over OPUS methods. The lower
accuracy and higher 1 for PU-learning is likely caused by
the unbalanced dataset as well.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion</title>
      <p>In this paper we have presented OPUS, a framework to
estimate a better prediction threshold when only one class
of the target labels is available. The advantage of such
optimizations is that we can adapt existing prediction
models without having to rely on fully labelled data. This is a
more realistic assumption when the true label of one class
is deferred. OPUS makes use of two types of threshold
estimation methods, where learned methods are more efective
in getting to an optimal prediction threshold at the cost of
requiring more data (only being available at a third
cam(a) Accuracy</p>
      <p>(b) 1
4
8
12
16
4
8
12
16
r
a
l
u
g
e
R
.50 ±.01 .77 ±.04 .79 ±.04 .80 ±.03 .67 ±.08 .74 ±.06 .81 ±.01 .82 ±.01 .85 ±.01 .86 ±.01 .66 ±.10
p
c
a
paign), compared to heuristic methods that do not have
these benefits or limitations. While the initial results are
promising, we want to elaborate the experimental analysis,
both by including more than two loyalty campaigns and
by applying OPUS on a diferent use case inside the student
learning domain. In the design of OPUS only characteristics
that belong to the current timestep are considered, not of
all previous timesteps. We argue that the progression of
these characteristics, as well as the model performance over
time, and how they compare to the same period in the
previous loyalty campaign can be interesting to improve the
framework in future work.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>W.</given-names>
            <surname>Rizzi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Di Francescomarino</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Ghidini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F. M.</given-names>
            <surname>Maggi</surname>
          </string-name>
          ,
          <article-title>How do I update my model? On the resilience of Predictive Process Monitoring models to change</article-title>
          , KAIS
          <volume>64</volume>
          (
          <year>2022</year>
          )
          <fpage>1385</fpage>
          -
          <lpage>1416</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>C.</given-names>
            <surname>Fahy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Gongora</surname>
          </string-name>
          ,
          <article-title>Classification in Dynamic Data Streams with a Scarcity of Labels</article-title>
          , IEEE TKDE (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>V. M. A.</given-names>
            <surname>Souza</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. F.</given-names>
            <surname>Silva</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gama</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. E. A. P. A.</given-names>
            <surname>Batista</surname>
          </string-name>
          ,
          <article-title>Data Stream Classification Guided by Clustering on Nonstationary Environments and Extreme Verification Latency</article-title>
          , in: SDM,
          <year>2015</year>
          , pp.
          <fpage>873</fpage>
          -
          <lpage>881</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>N. J.</given-names>
            <surname>Bombaij</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Gelper</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. G.</given-names>
            <surname>Dekimpe</surname>
          </string-name>
          ,
          <article-title>Designing successful temporary loyalty programs: An exploratory study on retailer and country diferences</article-title>
          ,
          <source>International Journal of Research in Marketing 39</source>
          (
          <year>2022</year>
          )
          <fpage>1275</fpage>
          -
          <lpage>1295</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>C.</given-names>
            <surname>Elkan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Noto</surname>
          </string-name>
          ,
          <article-title>Learning classifiers from only positive and unlabeled data</article-title>
          ,
          <source>KDD</source>
          (
          <year>2008</year>
          )
          <fpage>213</fpage>
          -
          <lpage>220</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>J.</given-names>
            <surname>Peeperkorn</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. O.</given-names>
            <surname>Vázquez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Stevens</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. D.</given-names>
            <surname>Smedt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Broucke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. D.</given-names>
            <surname>Weerdt</surname>
          </string-name>
          ,
          <article-title>Outcome-Oriented Predictive Process Monitoring on Positive and Unlabelled Event Logs</article-title>
          , ICPM workshops (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>J.</given-names>
            <surname>Bekker</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Robberechts</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Davis</surname>
          </string-name>
          ,
          <article-title>Beyond the Selected Completely at Random Assumption for Learning from Positive and Unlabeled Data, LNCS 11907 LNAI (</article-title>
          <year>2020</year>
          )
          <fpage>71</fpage>
          -
          <lpage>85</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>M.</given-names>
            <surname>Claesen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>De Smet</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. A.</given-names>
            <surname>Suykens</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>De Moor</surname>
          </string-name>
          ,
          <article-title>A robust ensemble approach to learn from positive and unlabeled data using SVM base models</article-title>
          ,
          <source>Neurocomputing</source>
          <volume>160</volume>
          (
          <year>2015</year>
          )
          <fpage>73</fpage>
          -
          <lpage>84</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>J.</given-names>
            <surname>Bekker</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Davis</surname>
          </string-name>
          ,
          <article-title>Learning from positive and unlabeled data: a survey</article-title>
          ,
          <source>Machine Learning</source>
          <volume>109</volume>
          (
          <year>2020</year>
          )
          <fpage>719</fpage>
          -
          <lpage>760</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>P. De Abreu Lopes</surname>
          </string-name>
          , H. De Arruda Camargo,
          <article-title>FuzzStream: Fuzzy data stream clustering based on the online-ofline framework</article-title>
          ,
          <source>IEEE International Conference on Fuzzy Systems</source>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>T. P.</given-names>
            da
            <surname>Silva</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V. M. A.</given-names>
            <surname>Souza</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. E. A. P. A.</given-names>
            <surname>Batista</surname>
          </string-name>
          , H. de Arruda Camargo,
          <article-title>A fuzzy classifier for data streams with infinitely delayed labels</article-title>
          ,
          <source>LNCS</source>
          <volume>11401</volume>
          (
          <year>2019</year>
          )
          <fpage>287</fpage>
          -
          <lpage>295</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>K. B. Dyer</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Capo</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Polikar</surname>
          </string-name>
          ,
          <article-title>Compose: A semisupervised learning framework for initially labeled nonstationary streaming data</article-title>
          ,
          <source>IEEE Transactions on Neural Networks and Learning Systems</source>
          <volume>25</volume>
          (
          <year>2014</year>
          )
          <fpage>12</fpage>
          -
          <lpage>26</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>B.</given-names>
            <surname>Settles</surname>
          </string-name>
          ,
          <article-title>Active Learning Literature Survey</article-title>
          ,
          <source>Technical Report</source>
          , University of Wisconsin-Madison,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>C.</given-names>
            <surname>Esposito</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. A.</given-names>
            <surname>Landrum</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Schneider</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Stiefl</surname>
          </string-name>
          , S. Riniker,
          <article-title>GHOST: Adjusting the Decision Threshold to Handle Imbalanced Data in Machine Learning</article-title>
          ,
          <source>Journal of Chemical Information and Modeling</source>
          <volume>61</volume>
          (
          <year>2021</year>
          )
          <fpage>2623</fpage>
          -
          <lpage>2640</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>Q.</given-names>
            <surname>Zou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Xie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Ju</surname>
          </string-name>
          ,
          <article-title>Finding the Best Classification Threshold in Imbalanced Classification</article-title>
          ,
          <source>Big Data Research</source>
          <volume>5</volume>
          (
          <year>2016</year>
          )
          <fpage>2</fpage>
          -
          <lpage>8</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>H.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Mu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zuo</surname>
          </string-name>
          ,
          <article-title>Support vector machine-based optimized decision threshold adjustment strategy for classifying imbalanced data</article-title>
          ,
          <source>Knowledge-Based Systems</source>
          <volume>76</volume>
          (
          <year>2015</year>
          )
          <fpage>67</fpage>
          -
          <lpage>78</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>J.</given-names>
            <surname>Montiel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Read</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bifet</surname>
          </string-name>
          , T. Abdessalem,
          <article-title>ScikitMultiflow: A Multi-output Streaming Framework</article-title>
          , JMLR
          <volume>19</volume>
          (
          <year>2018</year>
          )
          <fpage>1</fpage>
          -
          <lpage>5</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Spenrath</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Hassani</surname>
          </string-name>
          ,
          <string-name>
            <surname>B. F. van Dongen</surname>
          </string-name>
          ,
          <article-title>Online prediction of aggregated retailer consumer behaviour</article-title>
          ,
          <source>ICPM Workshops</source>
          (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Spenrath</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Hassani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. v.</given-names>
            <surname>Dongen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Tariq</surname>
          </string-name>
          ,
          <article-title>Learning an eficient distance metric for retailer transaction data</article-title>
          ,
          <source>in: PKDD</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>323</fpage>
          -
          <lpage>338</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>