1. Introduction

A Large-scale Dataset for Decision Making Algorithms

Yuta Saito

saito@hanjuku-kaso.com 0 1

Shunsuke Aihara

shunsuke.aihara@zozo.com 0 3

Megumi Matsutani

megumi.matsutani@zozo.com 0 3

Yusuke Narita

yusuke.narita@yale.edu 0 2 0 Causality in Search and Recommendation (CSR) and Simulation of Information Retrieval Evaluation (Sim4IR) workshops at SIGIR, 2021 Editors of the proceedings (editors): Krisztian Balog , Xianjie Chen, Xu Chen, David Maxwell, Paul Thomas, Shuo Zhang, Yi Zhang, and Yongfeng Zhang. " 1 Hanjuku-kaso , Co., Ltd 2 Yale University 3 ZOZO Technologies, Inc

We build and publicize the Open Bandit Dataset and Pipeline to facilitate scalable and reproducible research on bandit algorithms. They are especially suitable for of-policy evaluation (OPE), which attempts to predict the performance of hypothetical algorithms using data generated by a diferent algorithm. We construct the dataset based on experiments and implementations on a large-scale fashion e-commerce platform, ZOZOTOWN. The data contain the ground-truth about the performance of several bandit policies and enable the fair comparisons of diferent OPE estimators.

1. Introduction

other social domains such as healthcare [18] and education [19].

Interactive bandit and reinforcement learning systems While the research community has produced theoretiproduce log data valuable for evaluating and redesigning cal breakthroughs, the experimental evaluation of OPE the systems. For example, the logs of a news recommen- remains primitive. Specifically, it lacks a public benchdation system record which news article was presented mark dataset for comparing the performance of diferent and whether the user read it, giving the system designer methods. Researchers often validate their methods using a chance to make its recommendation more relevant. synthetic simulation environments [12, 20, 17]. A version Exploiting log data is, however, more dificult than con- of the synthetic approach is to modify multi-class classiventional supervised machine learning: the result is only ifcation datasets and treat supervised machine learning observed for the action chosen by the system but not for methods as bandit policies to evaluate of-policy estiall the other actions the system could have taken. The mators [21, 22, 23, 10]. An obvious problem with these logs are also biased in that the logs over-represent the studies is that there is no guarantee that their simulaactions favored by the system. tion environment is similar to real-world settings. To

A potential solution to this problem is an A/B test that solve this issue, [24, 25, 5, 13] use proprietary real-world compares the performance of counterfactual systems in datasets. Since these datasets are not public, however, it an online environment. However, A/B testing counter- remains challenging to reproduce the results, and comfactual systems is often dificult, since deploying a new pare their methods with new ideas in a fair manner. policy is time- and money-consuming, and entails a risk This is in contrast to other domains of machine learnof failure. ing, where large-scale open datasets, such as the Ima

This leads us to the problem of of-policy evaluation geNet dataset [26], have been pivotal in driving objective (OPE), which aims to estimate the performance of a coun- progress [27, 28, 29, 30, 31]. terfactual policy using only log data collected by a past Our goal is to implement and evaluate OPE of bandit (or behavior) policy. Such an evaluation allows us to algorithms in realistic and reproducible ways. We recompare the performance of candidate counterfactual lease the Open Bandit Dataset, a logged bandit feedback policies to decide which policy should be deployed. This collected on a large-scale fashion e-commerce platform, alternative approach thus solves the above problem with ZOZOTOWN.1 ZOZOTOWN is the largest fashion EC the A/B test approach. Applications range from contex- platform in Japan with over 3 billion USD annual Gross tual bandits [ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 ] and reinforcement Merchandise Value. When the platform produced the learning in the web industry [11, 12, 13, 14, 15, 16, 17] to data, it used Bernoulli Thompson Sampling (Bernoulli TS) and Random policies to recommend fashion items to users. The dataset includes an A/B test of these policies and collected over 26 million records of users’ clicks and the ground-truth about the performance of Bernoulli TS and Random. To streamline and standardize the analysis of the Open Bandit Dataset, we also provide the Open Bandit Pipeline, a series of implementations of dataset preprocessing, behavior bandit policy simulators, and

1https://corp.zozo.com/en/service/ OPE estimators. 2. Setup We consider a general multi-armed contextual bandit

setting. Let = {0, ..., } be a finite set of + 1 actions (equivalently, arms or treatments), that the decision maker can choose from. Let (· ) : → potential reward function that maps actions into rewards

R denote a or outcomes, where () is the reward when action is chosen (e.g., whether a fashion item as an action results in a click). Let denote a context vector (e.g., the user’s demographic profile and user-item interaction history) counterfactual policy which might be diferent from : := E( (· ),)∼ [ ∑︁ () (|)] =0 = E( (· ),)∼ , ∼ [ =0 ∑︁ () (|) (|) where the last equality uses the independence of and (· ) conditional on and the definition of (·| ).

We allow the counterfactual policy to be degenerate, i.e., it may choose a particular action with probability 1. Estimating before implementing in an online environment is valuable because may perform poorly

(1) . We think of ( (· ), ) as a random vector with un- value by comparing their estimated performances. maps each context ∈ into a distribution over ac- counterfactual policy. A widely-used method, DM [32], historical logged bandit feedback with rounds of obser- to predict the mean reward function. DM then uses it to ior policy as follows: (i) In each round = 1, ..., , is a good approximation to the mean reward function, • In each round = 1, ..., , ((· ), ) is i.i.d. tion cannot be easily quantified from data [22]. (, ) = E[ ()| = ]. define the mean reward function : × → known distribution . Given a vector of ( (· ), ), we

R as

We call a function : → ∆( ) a policy, which tions, where (|) is the probability of taking action

given a context vector . Let {(, , )}=1 be vations. := (0, ..., )′, where is a binary variable indicating whether action is chosen in round .

If is chosen in round , = 1, otherwise = 0. := ∑︀ =0

() and denote the reward and the context observed in round , respectively. We assume that a logged bandit feedback is generated by a behav((· ), ) is i.i.d. drawn from distribution ., (ii) Given , an action is randomly chosen based on (·| ), creating the action choice and the associated reward .

drawn from distribution .

associated reward . • Given , an action is randomly chosen based on (·| ), creating the action choice and the ∑︀

=0 Suppose that is fixed for all rounds, and thus is i.i.d. across rounds. Because ((· ), ) is i.i.d. across rounds and =

(), each observation (, , ) is i.i.d. across rounds. Note that is independent of (· ) conditional on .

Of-Policy Evaluation 3.1. Prediction Target We are interested in using the historical logged bandit

data to estimate the following policy value of any given

3.2. Benchmark Estimators

There are several approaches to estimate the value of the ifrst learns a supervised machine learning model, such as random forest, ridge regression, and gradient boosting, estimate the policy value as

ˆ =

1 ∑︁ ∑︁ =1 =0 (|)ˆ(|). where ˆ(|) is the estimated reward function. If ˆ(|) this estimator accurately predicts the policy value of the counterfactual policy . If ˆ(|) fails to approximate the mean reward function well, however, the final estimator is no longer consistent. The model misspecification issue is problematic because the extent of misspecifica

To alleviate the issue with DM, researchers often use another estimator called IPW [33, 6]. IPW re-weights the rewards by the ratio of the counterfactual policy and behavior policy as

ˆ =

1 ∑︁ ∑︁ =1 =0 (|) . (|)

When the behavior policy is known, the IPW estimator is unbiased and consistent for the policy value. However, it can have a large variance, especially when the counterfactual policy significantly deviates from the behavior

The final approach is DR [ 21], which combines the policy. above two estimators as

ˆ = =1 =0 1 ∑︁ ∑︁( − ˆ(|)) (|) (|) Notes: Bernoulli TS stands for Bernoulli Thompson Sampling. #Data is the total number of user impressions observed during the 7-day experiment. #Items is the total number of items having a non-zero probability of being recommended by each behavior policy. Average Age is the average age of users in each campaign. CTR is the percentage of a click being observed in log data, and this is the ground-truth performance of behavior policies in each campaign. 95% confidence interval (CI) of CTR is calculated based on a normal approximation of Bernoulli sampling. Relative-CTR is CTR relative to that of the Random policy for the “All” campaign. + (|)ˆ(|).

statistics of the dataset in Table 1. The data is large and contains many millions of recommendation instances.

The number of actions is also sizable, so this setting is challenging for bandit algorithms and their OPE.

5. Conclusion and Future Work

DR mimics IPW to use a weighted version of rewards, but DR also uses the estimated mean reward function as a control variate to decrease the variance. It preserves the consistency of IPW if either the importance weight or the mean reward estimator is accurate (a property called double robustness). Moreover, DR is semiparametric eficient [5] when the mean reward estimator is correctly specified. On the other hand, when it is wrong, this estimator can have larger asymptotic mean-squared-error than IPW [34] and perform poorly in practice [35].

To enable realistic and reproducible evaluation of of

policy evaluation of bandit algorithms, we have publicized the Open Bandit Dataset–a benchmark logged bandit dataset collected on a large-scale fashion e-commerce platform.

In the near future, we plan to publicize the perfor4. Dataset mance of the selected counterfactual policy in an online environment. Such an evaluation will produce additional We apply and evaluate the above methods by using real- log data generated by the contextual policy (while the world data. Our data is logged bandit feedback data we current open dataset contains only log data generated by call the Open Bandit Dataset.2 The dataset is provided by the old context-free policy). We aim to constantly expand ZOZO, Inc.3, the largest Japanese fashion e-commerce and improve the Open Bandit Dataset to include more company with a market capitalization of over 5 billion data and tasks.

USD (as of May 2020). The company recently started using context-free multi-armed bandit algorithms to recommend fashion items to users in their large-scale fashion References e-commerce platform called ZOZOTOWN.

We collected the data in a 7-days experiment in late November 2019 on three “campaigns,” corresponding to “all”, “men’s”, and “women’s” items, respectively. Each campaign randomly uses either the Random algorithm or the Bernoulli Thompson Sampling (Bernoulli TS) algorithm for each user impression. In the notation of our bandit setups, action is one of the possible fashion items, while reward is a click indicator. We describe some

2https://research.zozo.com/data.html

3https://corp.zozo.com/en/about/profile/ and Conference Proceedings, volume 26, 2012, pp. ings of the 32th International Conference on Ma19–36. chine Learning, 2015, pp. 2380–2388. [3] L. Li, W. Chu, J. Langford, R. E. Schapire, A [16] P. S. Thomas, G. Theocharous, M. Ghavamzadeh, Contextual-bandit Approach to Personalized News High-Confidence Of-Policy Evaluation, AAAI Article Recommendation, in: Proceedings of the (2015) 3000–3006. 19th International Conference on World Wide Web, [17] T. Xie, Y. Ma, Y.-X. Wang, Towards optimal ofACM, 2010, pp. 661–670. policy evaluation for reinforcement learning with [4] L. Li, W. Chu, J. Langford, X. Wang, Unbiased Of- marginalized importance sampling, in: Advances lfine Evaluation of Contextual-bandit-based News in Neural Information Processing Systems, 2019, pp. Article Recommendation Algorithms, in: Proceed- 9665–9675. ings of the Fourth ACM International Conference [18] S. A. Murphy, M. J. van der Laan, J. M. Robins, C. P. on Web Search and Data Mining, 2011, pp. 297–306. P. R. Group, Marginal mean models for dynamic [5] Y. Narita, S. Yasui, K. Yata, Eficient counterfactual regimes, Journal of the American Statistical Assolearning from bandit feedback, in: Proceedings ciation 96 (2001) 1410–1423. of the AAAI Conference on Artificial Intelligence, [19] T. Mandel, Y.-E. Liu, S. Levine, E. Brunskill, volume 33, 2019, pp. 4634–4641. Z. Popovic, Ofline policy evaluation across repre[6] A. Strehl, J. Langford, L. Li, S. M. Kakade, Learning sentations with applications to educational games, from Logged Implicit Exploration Data, in: Ad- in: Proceedings of the 2014 international confervances in Neural Information Processing Systems, ence on Autonomous agents and multi-agent sys2010, pp. 2217–2225. tems, 2014, pp. 1077–1084. [7] A. Swaminathan, T. Joachims, Batch Learning from [20] C. Voloshin, H. M. Le, N. Jiang, Y. Yue, Empirical Logged Bandit Feedback through Counterfactual study of of-policy policy evaluation for reinforceRisk Minimization, Journal of Machine Learning ment learning, arXiv preprint arXiv:1911.06854 Research 16 (2015) 1731–1755. (2019). [8] A. Swaminathan, T. Joachims, The Self-normalized [21] M. Dudík, D. Erhan, J. Langford, L. Li, Doubly RoEstimator for Counterfactual Learning, in: Ad- bust Policy Evaluation and Optimization, Statistical vances in Neural Information Processing Systems, Science 29 (2014) 485–511.

2015, pp. 3231–3239. [22] M. Farajtabar, Y. Chow, M. Ghavamzadeh, More [9] A. Swaminathan, A. Krishnamurthy, A. Agarwal, robust doubly robust of-policy evaluation, in: ProM. Dudik, J. Langford, D. Jose, I. Zitouni, Of-policy ceedings of the 35th International Conference on Evaluation for Slate Recommendation, in: Ad- Machine Learning, 2018, pp. 1447–1456. vances in Neural Information Processing Systems, [23] N. Vlassis, A. Bibaut, M. Dimakopoulou, T. Jebara, 2017, pp. 3635–3645. On the design of estimators for bandit of-policy [10] Y.-X. Wang, A. Agarwal, M. Dudik, Optimal and evaluation, in: International Conference on MaAdaptive Of-policy Evaluation in Contextual Ban- chine Learning, 2019, pp. 6468–6476. dits, in: Proceedings of the 34th International Con- [24] A. Gilotte, C. Calauzènes, T. Nedelec, A. Abraham, ference on Machine Learning, 2017, pp. 3589–3597. S. Dollé, Ofline a/b testing for recommender sys[11] N. Jiang, L. Li, Doubly robust of-policy value eval- tems, in: Proceedings of the Eleventh ACM Internauation for reinforcement learning, in: Proceedings tional Conference on Web Search and Data Mining, of the 33rd International Conference on Machine 2018, pp. 198–206.

Learning, 2016, pp. 652–661. [25] A. Gruson, P. Chandar, C. Charbuillet, J. McInerney, [12] Y. Liu, O. Gottesman, A. Raghu, M. Komorowski, S. Hansen, D. Tardieu, B. Carterette, Ofline evaluA. A. Faisal, F. Doshi-Velez, E. Brunskill, Representa- ation to make decisions about playlist recommention balancing mdps for of-policy policy evaluation, dation algorithms, in: Proceedings of the Twelfth in: Advances in Neural Information Processing Sys- ACM International Conference on Web Search and tems, 2018, pp. 2644–2653. Data Mining, 2019, pp. 420–428. [13] Y. Narita, S. Yasui, K. Yata, Of-policy ban- [26] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, L. Feidit and reinforcement learning, arXiv preprint Fei, Imagenet: A large-scale hierarchical image arXiv:2002.08536 (2020). database, in: 2009 IEEE conference on computer [14] P. Thomas, E. Brunskill, Data-eficient Of-policy vision and pattern recognition, Ieee, 2009, pp. 248– Policy Evaluation for Reinforcement Learning, in: 255.

Proceedings of the 33rd International Conference [27] V. P. Dwivedi, C. K. Joshi, T. Laurent, Y. Bengio, on Machine Learning, 2016, pp. 2139–2148. X. Bresson, Benchmarking graph neural networks, [15] P. Thomas, G. Theocharous, M. Ghavamzadeh, arXiv preprint arXiv:2003.00982 (2020).

High confidence policy improvement, in: Proceed- [28] W. Hu, M. Fey, M. Zitnik, Y. Dong, H. Ren, B. Liu,

M. Catasta, J. Leskovec, Open graph benchmark:

Datasets for machine learning on graphs, arXiv preprint arXiv:2005.00687 (2020). [29] R. Girshick, J. Donahue, T. Darrell, J. Malik, Rich feature hierarchies for accurate object detection and semantic segmentation, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2014, pp. 580–587. [30] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778. [31] J. Long, E. Shelhamer, T. Darrell, Fully convolutional networks for semantic segmentation, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 3431–3440. [32] A. Beygelzimer, J. Langford, The ofset tree for learning with partial labels, in: Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, 2009, pp. 129–138. [33] D. Precup, R. S. Sutton, S. Singh, Eligibility Traces for Of-Policy Policy Evaluation, in: Proceedings of the 17th International Conference on Machine Learning, 2000, pp. 759–766. [34] N. Kallus, M. Uehara, Intrinsically eficient, stable, and bounded of-policy evaluation for reinforcement learning, in: Advances in Neural Information Processing Systems, 2019. [35] J. D. Kang, J. L. Schafer, et al., Demystifying double robustness: A comparison of alternative strategies for estimating a population mean from incomplete data, Statistical science 22 (2007) 523–539.

[1]

Bottou ,

Peters ,

Quiñonero-Candela ,

D. X.

Charles ,

D. M.

Chickering ,

Portugaly ,

Ray ,

Simard , E. Snelson, Counterfactual Reasoning and Learning Systems: The Example of Computational Advertising, Journal of Machine Learning Research 14 ( 2013 ) 3207 - 3260 .

[2]

Li ,

Chu ,

Langford ,

Moon ,

Wang , An Unbiased Ofline Evaluation of Contextual Bandit Algorithms with Generalized Linear Models , in : Journal of Machine Learning Research: Workshop