<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>A Large-scale Dataset for Decision Making Algorithms</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Yuta Saito</string-name>
          <email>saito@hanjuku-kaso.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Shunsuke Aihara</string-name>
          <email>shunsuke.aihara@zozo.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Megumi Matsutani</string-name>
          <email>megumi.matsutani@zozo.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yusuke Narita</string-name>
          <email>yusuke.narita@yale.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Causality in Search and Recommendation (CSR) and Simulation of Information Retrieval Evaluation (Sim4IR) workshops at SIGIR, 2021 Editors of the proceedings (editors): Krisztian Balog</institution>
          ,
          <addr-line>Xianjie Chen, Xu Chen, David Maxwell, Paul Thomas, Shuo Zhang, Yi Zhang, and Yongfeng Zhang. "</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Hanjuku-kaso</institution>
          ,
          <addr-line>Co., Ltd</addr-line>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Yale University</institution>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>ZOZO Technologies, Inc</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>We build and publicize the Open Bandit Dataset and Pipeline to facilitate scalable and reproducible research on bandit algorithms. They are especially suitable for of-policy evaluation (OPE), which attempts to predict the performance of hypothetical algorithms using data generated by a diferent algorithm. We construct the dataset based on experiments and implementations on a large-scale fashion e-commerce platform, ZOZOTOWN. The data contain the ground-truth about the performance of several bandit policies and enable the fair comparisons of diferent OPE estimators.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>other social domains such as healthcare [18] and
education [19].</p>
      <p>Interactive bandit and reinforcement learning systems While the research community has produced
theoretiproduce log data valuable for evaluating and redesigning cal breakthroughs, the experimental evaluation of OPE
the systems. For example, the logs of a news recommen- remains primitive. Specifically, it lacks a public
benchdation system record which news article was presented mark dataset for comparing the performance of diferent
and whether the user read it, giving the system designer methods. Researchers often validate their methods using
a chance to make its recommendation more relevant. synthetic simulation environments [12, 20, 17]. A version
Exploiting log data is, however, more dificult than con- of the synthetic approach is to modify multi-class
classiventional supervised machine learning: the result is only ifcation datasets and treat supervised machine learning
observed for the action chosen by the system but not for methods as bandit policies to evaluate of-policy
estiall the other actions the system could have taken. The mators [21, 22, 23, 10]. An obvious problem with these
logs are also biased in that the logs over-represent the studies is that there is no guarantee that their
simulaactions favored by the system. tion environment is similar to real-world settings. To</p>
      <p>A potential solution to this problem is an A/B test that solve this issue, [24, 25, 5, 13] use proprietary real-world
compares the performance of counterfactual systems in datasets. Since these datasets are not public, however, it
an online environment. However, A/B testing counter- remains challenging to reproduce the results, and
comfactual systems is often dificult, since deploying a new pare their methods with new ideas in a fair manner.
policy is time- and money-consuming, and entails a risk This is in contrast to other domains of machine
learnof failure. ing, where large-scale open datasets, such as the
Ima</p>
      <p>
        This leads us to the problem of of-policy evaluation geNet dataset [26], have been pivotal in driving objective
(OPE), which aims to estimate the performance of a coun- progress [27, 28, 29, 30, 31].
terfactual policy using only log data collected by a past Our goal is to implement and evaluate OPE of bandit
(or behavior) policy. Such an evaluation allows us to algorithms in realistic and reproducible ways. We
recompare the performance of candidate counterfactual lease the Open Bandit Dataset, a logged bandit feedback
policies to decide which policy should be deployed. This collected on a large-scale fashion e-commerce platform,
alternative approach thus solves the above problem with ZOZOTOWN.1 ZOZOTOWN is the largest fashion EC
the A/B test approach. Applications range from contex- platform in Japan with over 3 billion USD annual Gross
tual bandits [
        <xref ref-type="bibr" rid="ref1 ref2">1, 2, 3, 4, 5, 6, 7, 8, 9, 10</xref>
        ] and reinforcement Merchandise Value. When the platform produced the
learning in the web industry [11, 12, 13, 14, 15, 16, 17] to data, it used Bernoulli Thompson Sampling (Bernoulli
TS) and Random policies to recommend fashion items to
users. The dataset includes an A/B test of these policies
and collected over 26 million records of users’ clicks and
the ground-truth about the performance of Bernoulli TS
and Random. To streamline and standardize the analysis
of the Open Bandit Dataset, we also provide the Open
Bandit Pipeline, a series of implementations of dataset
preprocessing, behavior bandit policy simulators, and
      </p>
      <sec id="sec-1-1">
        <title>1https://corp.zozo.com/en/service/</title>
      </sec>
      <sec id="sec-1-2">
        <title>OPE estimators.</title>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>2. Setup</title>
      <sec id="sec-2-1">
        <title>We consider a general multi-armed contextual bandit</title>
        <p>setting. Let  = {0, ..., } be a finite set of  + 1
actions (equivalently, arms or treatments), that the decision
maker can choose from. Let  (· ) :  →
potential reward function that maps actions into rewards</p>
        <p>R denote a
or outcomes, where  () is the reward when action  is
chosen (e.g., whether a fashion item as an action results
in a click). Let  denote a context vector (e.g., the user’s
demographic profile and user-item interaction history)
counterfactual policy  which might be diferent from  :
  := E( (· ),)∼ [
∑︁  () (|)]

=0
= E( (· ),)∼ , ∼   [

=0
∑︁  ()
 (|)
 (|)
where the last equality uses the independence of  and
 (· ) conditional on  and the definition of  (·| ).</p>
        <p>We allow the counterfactual policy  to be degenerate,
i.e., it may choose a particular action with probability
1. Estimating   before implementing  in an online
environment is valuable because 
may perform poorly</p>
        <p>(1)
 . We think of ( (· ), ) as a random vector with un- value by comparing their estimated performances.
maps each context  ∈  into a distribution over ac- counterfactual policy. A widely-used method, DM [32],
historical logged bandit feedback with  rounds of obser- to predict the mean reward function. DM then uses it to
ior policy   as follows: (i) In each round  = 1, ...,  , is a good approximation to the mean reward function,
• In each round  = 1, ...,  , ((· ), ) is i.i.d. tion cannot be easily quantified from data [22].
 (, ) = E[ ()| = ].
define the mean reward function  :  ×  →
known distribution . Given a vector of ( (· ), ), we</p>
        <p>R as</p>
        <p>We call a function  :  → ∆( ) a policy, which
tions, where  (|) is the probability of taking action</p>
        <p>given a context vector . Let {(, , )}=1 be
vations.  := (0, ..., )′, where  is a binary
variable indicating whether action  is chosen in round .</p>
        <p>If  is chosen in round ,  = 1, otherwise  = 0.
 := ∑︀
=0</p>
        <p>() and  denote the reward and
the context observed in round , respectively. We assume
that a logged bandit feedback is generated by a
behav((· ), ) is i.i.d. drawn from distribution ., (ii) Given
, an action is randomly chosen based on  (·| ),
creating the action choice  and the associated reward
.</p>
        <p>drawn from distribution .</p>
        <p>associated reward .
• Given , an action is randomly chosen based on
 (·| ), creating the action choice  and the
∑︀</p>
        <p>=0
Suppose that   is fixed for all rounds, and thus  is
i.i.d. across rounds. Because ((· ), ) is i.i.d. across
rounds and  =</p>
        <p>(), each observation
(, , ) is i.i.d. across rounds. Note that  is
independent of (· ) conditional on .</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Of-Policy Evaluation</title>
      <sec id="sec-3-1">
        <title>3.1. Prediction Target</title>
        <sec id="sec-3-1-1">
          <title>We are interested in using the historical logged bandit</title>
          <p>data to estimate the following policy value of any given</p>
        </sec>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Benchmark Estimators</title>
        <p>There are several approaches to estimate the value of the
ifrst learns a supervised machine learning model, such as
random forest, ridge regression, and gradient boosting,
estimate the policy value as</p>
        <p>ˆ
  =</p>
        <p>1 ∑︁ ∑︁

=1 =0
 (|)ˆ(|).
where ˆ(|) is the estimated reward function. If ˆ(|)
this estimator accurately predicts the policy value of the
counterfactual policy   . If ˆ(|) fails to approximate
the mean reward function well, however, the final
estimator is no longer consistent. The model misspecification
issue is problematic because the extent of
misspecifica</p>
        <p>To alleviate the issue with DM, researchers often use
another estimator called IPW [33, 6]. IPW re-weights
the rewards by the ratio of the counterfactual policy and
behavior policy as</p>
        <p>ˆ
   =</p>
        <p>1 ∑︁ ∑︁ 
=1 =0
 (|) .
 (|)</p>
        <sec id="sec-3-2-1">
          <title>When the behavior policy is known, the IPW estimator is unbiased and consistent for the policy value. However, it can have a large variance, especially when the counterfactual policy significantly deviates from the behavior</title>
          <p>The final approach is DR [ 21], which combines the
policy.
above two estimators as</p>
          <p>ˆ
  =
 
=1 =0

1 ∑︁ ∑︁( − ˆ(|))
 (|)
 (|)
Notes: Bernoulli TS stands for Bernoulli Thompson Sampling. #Data is the total number of user impressions
observed during the 7-day experiment. #Items is the total number of items having a non-zero probability of
being recommended by each behavior policy. Average Age is the average age of users in each campaign. CTR is
the percentage of a click being observed in log data, and this is the ground-truth performance of behavior
policies in each campaign. 95% confidence interval (CI) of CTR is calculated based on a normal approximation of
Bernoulli sampling. Relative-CTR is CTR relative to that of the Random policy for the “All” campaign.
+  (|)ˆ(|).</p>
          <p>statistics of the dataset in Table 1. The data is large and
contains many millions of recommendation instances.</p>
          <p>The number of actions is also sizable, so this setting is
challenging for bandit algorithms and their OPE.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>5. Conclusion and Future Work</title>
      <p>DR mimics IPW to use a weighted version of rewards,
but DR also uses the estimated mean reward function as
a control variate to decrease the variance. It preserves
the consistency of IPW if either the importance weight or
the mean reward estimator is accurate (a property called
double robustness). Moreover, DR is semiparametric
eficient [5] when the mean reward estimator is correctly
specified. On the other hand, when it is wrong, this
estimator can have larger asymptotic mean-squared-error
than IPW [34] and perform poorly in practice [35].</p>
      <sec id="sec-4-1">
        <title>To enable realistic and reproducible evaluation of of</title>
        <p>policy evaluation of bandit algorithms, we have
publicized the Open Bandit Dataset–a benchmark logged
bandit dataset collected on a large-scale fashion e-commerce
platform.</p>
        <p>In the near future, we plan to publicize the
perfor4. Dataset mance of the selected counterfactual policy in an online
environment. Such an evaluation will produce additional
We apply and evaluate the above methods by using real- log data generated by the contextual policy (while the
world data. Our data is logged bandit feedback data we current open dataset contains only log data generated by
call the Open Bandit Dataset.2 The dataset is provided by the old context-free policy). We aim to constantly expand
ZOZO, Inc.3, the largest Japanese fashion e-commerce and improve the Open Bandit Dataset to include more
company with a market capitalization of over 5 billion data and tasks.</p>
        <p>USD (as of May 2020). The company recently started
using context-free multi-armed bandit algorithms to
recommend fashion items to users in their large-scale fashion References
e-commerce platform called ZOZOTOWN.</p>
        <p>We collected the data in a 7-days experiment in late
November 2019 on three “campaigns,” corresponding to
“all”, “men’s”, and “women’s” items, respectively. Each
campaign randomly uses either the Random algorithm
or the Bernoulli Thompson Sampling (Bernoulli TS)
algorithm for each user impression. In the notation of our
bandit setups, action  is one of the possible fashion items,
while reward  is a click indicator. We describe some</p>
      </sec>
      <sec id="sec-4-2">
        <title>2https://research.zozo.com/data.html</title>
        <p>3https://corp.zozo.com/en/about/profile/
and Conference Proceedings, volume 26, 2012, pp. ings of the 32th International Conference on
Ma19–36. chine Learning, 2015, pp. 2380–2388.
[3] L. Li, W. Chu, J. Langford, R. E. Schapire, A [16] P. S. Thomas, G. Theocharous, M. Ghavamzadeh,
Contextual-bandit Approach to Personalized News High-Confidence Of-Policy Evaluation, AAAI
Article Recommendation, in: Proceedings of the (2015) 3000–3006.
19th International Conference on World Wide Web, [17] T. Xie, Y. Ma, Y.-X. Wang, Towards optimal
ofACM, 2010, pp. 661–670. policy evaluation for reinforcement learning with
[4] L. Li, W. Chu, J. Langford, X. Wang, Unbiased Of- marginalized importance sampling, in: Advances
lfine Evaluation of Contextual-bandit-based News in Neural Information Processing Systems, 2019, pp.
Article Recommendation Algorithms, in: Proceed- 9665–9675.
ings of the Fourth ACM International Conference [18] S. A. Murphy, M. J. van der Laan, J. M. Robins, C. P.
on Web Search and Data Mining, 2011, pp. 297–306. P. R. Group, Marginal mean models for dynamic
[5] Y. Narita, S. Yasui, K. Yata, Eficient counterfactual regimes, Journal of the American Statistical
Assolearning from bandit feedback, in: Proceedings ciation 96 (2001) 1410–1423.
of the AAAI Conference on Artificial Intelligence, [19] T. Mandel, Y.-E. Liu, S. Levine, E. Brunskill,
volume 33, 2019, pp. 4634–4641. Z. Popovic, Ofline policy evaluation across
repre[6] A. Strehl, J. Langford, L. Li, S. M. Kakade, Learning sentations with applications to educational games,
from Logged Implicit Exploration Data, in: Ad- in: Proceedings of the 2014 international
confervances in Neural Information Processing Systems, ence on Autonomous agents and multi-agent
sys2010, pp. 2217–2225. tems, 2014, pp. 1077–1084.
[7] A. Swaminathan, T. Joachims, Batch Learning from [20] C. Voloshin, H. M. Le, N. Jiang, Y. Yue, Empirical
Logged Bandit Feedback through Counterfactual study of of-policy policy evaluation for
reinforceRisk Minimization, Journal of Machine Learning ment learning, arXiv preprint arXiv:1911.06854
Research 16 (2015) 1731–1755. (2019).
[8] A. Swaminathan, T. Joachims, The Self-normalized [21] M. Dudík, D. Erhan, J. Langford, L. Li, Doubly
RoEstimator for Counterfactual Learning, in: Ad- bust Policy Evaluation and Optimization, Statistical
vances in Neural Information Processing Systems, Science 29 (2014) 485–511.</p>
        <p>2015, pp. 3231–3239. [22] M. Farajtabar, Y. Chow, M. Ghavamzadeh, More
[9] A. Swaminathan, A. Krishnamurthy, A. Agarwal, robust doubly robust of-policy evaluation, in:
ProM. Dudik, J. Langford, D. Jose, I. Zitouni, Of-policy ceedings of the 35th International Conference on
Evaluation for Slate Recommendation, in: Ad- Machine Learning, 2018, pp. 1447–1456.
vances in Neural Information Processing Systems, [23] N. Vlassis, A. Bibaut, M. Dimakopoulou, T. Jebara,
2017, pp. 3635–3645. On the design of estimators for bandit of-policy
[10] Y.-X. Wang, A. Agarwal, M. Dudik, Optimal and evaluation, in: International Conference on
MaAdaptive Of-policy Evaluation in Contextual Ban- chine Learning, 2019, pp. 6468–6476.
dits, in: Proceedings of the 34th International Con- [24] A. Gilotte, C. Calauzènes, T. Nedelec, A. Abraham,
ference on Machine Learning, 2017, pp. 3589–3597. S. Dollé, Ofline a/b testing for recommender
sys[11] N. Jiang, L. Li, Doubly robust of-policy value eval- tems, in: Proceedings of the Eleventh ACM
Internauation for reinforcement learning, in: Proceedings tional Conference on Web Search and Data Mining,
of the 33rd International Conference on Machine 2018, pp. 198–206.</p>
        <p>Learning, 2016, pp. 652–661. [25] A. Gruson, P. Chandar, C. Charbuillet, J. McInerney,
[12] Y. Liu, O. Gottesman, A. Raghu, M. Komorowski, S. Hansen, D. Tardieu, B. Carterette, Ofline
evaluA. A. Faisal, F. Doshi-Velez, E. Brunskill, Representa- ation to make decisions about playlist
recommention balancing mdps for of-policy policy evaluation, dation algorithms, in: Proceedings of the Twelfth
in: Advances in Neural Information Processing Sys- ACM International Conference on Web Search and
tems, 2018, pp. 2644–2653. Data Mining, 2019, pp. 420–428.
[13] Y. Narita, S. Yasui, K. Yata, Of-policy ban- [26] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, L.
Feidit and reinforcement learning, arXiv preprint Fei, Imagenet: A large-scale hierarchical image
arXiv:2002.08536 (2020). database, in: 2009 IEEE conference on computer
[14] P. Thomas, E. Brunskill, Data-eficient Of-policy vision and pattern recognition, Ieee, 2009, pp. 248–
Policy Evaluation for Reinforcement Learning, in: 255.</p>
        <p>Proceedings of the 33rd International Conference [27] V. P. Dwivedi, C. K. Joshi, T. Laurent, Y. Bengio,
on Machine Learning, 2016, pp. 2139–2148. X. Bresson, Benchmarking graph neural networks,
[15] P. Thomas, G. Theocharous, M. Ghavamzadeh, arXiv preprint arXiv:2003.00982 (2020).</p>
        <p>High confidence policy improvement, in: Proceed- [28] W. Hu, M. Fey, M. Zitnik, Y. Dong, H. Ren, B. Liu,</p>
      </sec>
      <sec id="sec-4-3">
        <title>M. Catasta, J. Leskovec, Open graph benchmark:</title>
        <p>Datasets for machine learning on graphs, arXiv
preprint arXiv:2005.00687 (2020).
[29] R. Girshick, J. Donahue, T. Darrell, J. Malik, Rich
feature hierarchies for accurate object detection
and semantic segmentation, in: Proceedings of the
IEEE conference on computer vision and pattern
recognition, 2014, pp. 580–587.
[30] K. He, X. Zhang, S. Ren, J. Sun, Deep residual
learning for image recognition, in: Proceedings of the
IEEE conference on computer vision and pattern
recognition, 2016, pp. 770–778.
[31] J. Long, E. Shelhamer, T. Darrell, Fully
convolutional networks for semantic segmentation, in:
Proceedings of the IEEE conference on computer vision
and pattern recognition, 2015, pp. 3431–3440.
[32] A. Beygelzimer, J. Langford, The ofset tree for
learning with partial labels, in: Proceedings of
the 15th ACM SIGKDD international conference
on Knowledge discovery and data mining, 2009, pp.
129–138.
[33] D. Precup, R. S. Sutton, S. Singh, Eligibility Traces
for Of-Policy Policy Evaluation, in: Proceedings
of the 17th International Conference on Machine
Learning, 2000, pp. 759–766.
[34] N. Kallus, M. Uehara, Intrinsically eficient, stable,
and bounded of-policy evaluation for
reinforcement learning, in: Advances in Neural Information
Processing Systems, 2019.
[35] J. D. Kang, J. L. Schafer, et al., Demystifying double
robustness: A comparison of alternative strategies
for estimating a population mean from incomplete
data, Statistical science 22 (2007) 523–539.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>L.</given-names>
            <surname>Bottou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Peters</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Quiñonero-Candela</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. X.</given-names>
            <surname>Charles</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. M.</given-names>
            <surname>Chickering</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Portugaly</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Ray</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Simard</surname>
          </string-name>
          , E. Snelson,
          <source>Counterfactual Reasoning and Learning Systems: The Example of Computational Advertising, Journal of Machine Learning Research</source>
          <volume>14</volume>
          (
          <year>2013</year>
          )
          <fpage>3207</fpage>
          -
          <lpage>3260</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>L.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Chu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Langford</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Moon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <article-title>An Unbiased Ofline Evaluation of Contextual Bandit Algorithms with Generalized Linear Models</article-title>
          , in
          <source>: Journal of Machine Learning Research: Workshop</source>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>