-

Colton Botta

Avi Segal

avise@post.bgu.ac.il

Kobi Gal

kobig@bgu.ac.il 0

Contextual Multi-Armed Bandit

Interventions

Incentives

0 University of Edinburgh

Online systems utilize user data, such as demographics, past performance, preferences and skillset to construct an accurate model of users and maximize personalization. Some of these user features are “shallow” traits which seldom change (e.g. age, race, gender) while others are “deep” traits that are more volatile (e.g. performance, goals, interests). In this work, we explore how reasoning about this diversity of user features can enhance performance of personalized systems. By modeling the personalization process as a Reinforcement Learning (RL) problem, we introduce Diversity Aware Bandits for Intervention Personaliztion (DABIP), a novel contextual multi-armed bandit algorithm that leverages the dynamics within user features to cluster users while maximizing outcomes. We demonstrate the eficacy of this approach using two real world datasets from diferent domains.

1. Introduction

© 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). bandit algorithm, as a baseline in two domains. Our results show that DABIP achieves a higher average reward than LOCB in each domain when predicting intervention outcomes per user. 2. Background We give an overview of CMAB algorithms and diversity. 2.1. Contextual Multi-Armed Bandits Contextual Multi-Armed Bandit (CMAB) is an extension of the Multi-Armed Bandit (MAB) problem where, at each timestep, the agent is presented with a list of arms (actions) and a context vector (additional data) about the environment. The agent needs to select and perform a single action. The agent then receives a reward for that arm only. Over time, the agent learns the underlying reward distribution of each arm and how that distribution is influenced by the context, and endeavors to maximize the total reward received over time [ 5 ]. One recent work introduced the Local Clustering in Bandits (LOCB) algorithm [ 4 ] which implemented a “soft” clustering approach, by which users are clustered together if their preferences are within a certain threshold of each other. 2.2. Diversity The existence of diferences between humans in a group is one notion of diversity [ 6 ], with these diferences often falling into two distinct categories: surface-level diferences and deep-level diferences [ 7 ]. Surface-level diferences include, for example, age, sex, ethnicity, and race and are generally defined by their low-dynamics and ability to be observed immediately [ 8 ]. Deep-level diferences, on the other hand, may include skills, values, preferences, and desires. These are more volatile and can only be observed through prolonged interaction between people [ 7 ]. One example of the importance of this classification is highlighted by the WeNet project, which places human diversity at the center of a new machine mediated paradigm of social interactions [ 9, 6 ]. 3. DABIP We now describe a Diversity Aware Bandit for Intervention Personalization algorithm. 3.1. Problem Definition Let = {1, ..., } represent a set of n total users and = 1, ..., represent a sequence of timesteps. At timestep, t, a user, , is drawn such that ∈ . Alongside , the agent receives the context, = { 1, , 2, , ..., , } with one context vector for each of k arms and each context vector having dimension d such that , ∈ ℝ . The agent chooses one arm , , to recommend to and receives reward in return. We assume that each user is associated with an unknown bandit parameter , that describes how interacts with the environment and can be thought of as a representation of how user behaves [ 4 ]. As in previous bandit settings [ 10, 4, 11 ], the goal is to minimize the total regret, given by:

= ∑[ ,

( =1 , ∈ , , ) − , ] (1) where, at each round, , we compute the regret by taking the reward achieved from the best possible arm choice, , , and subtracting the reward achieved from the agent’s chosen arm, . We also assume that each user, i, has a set of features, F, of length q such that at any time, t, there exists , = { ,1, , ,2, ...,

,, }. 3.2. DABIP Algorithm The algorithm has three main steps: (1) Calculate the underlying feature dynamics of all users over time, (2) Form clusters of users with similar feature dynamics, then (3) Utilize the clusters and past user performance to personalize interventions to users. The full details of the algorithm are given in Appendix A. 4. DABIP Performance in Multiple Domains We apply the DABIP algorithm to two datasets from two diferent domains. 4.1. Eedi Dataset Eedi1 [ 12 ] dataset includes over 17 million interactions of students answering multiple choice questions. It provides interaction logs of the student ID, question ID, student answer (range a-d), and the correct answer (range a-d). Every question has an associated list of features including a question ID, and a list of subject IDs. Every student has an associated list of features including gender, date of birth etc. 4.2. WeNet Dataset The WeNet dataset includes 6600 interactions of users participating in WeNet’s Ask4Help pilot [ 9, 13, 14 ]. Users participated in asking and answering questions, while receiving one of 4 diferent interventions messages that encourage their participation. The dataset provides interaction logs of the user ID, intervention messages ID, user activity level following the intervention. Additionally, every user has an associated list of features including location, big-5 characteristics, music and sports preferences, and past activity in the app. Finally, a binary label is computed for each intervention denoting if user activity post intervention surpassed a given threshold (median over post intervention activities). (a) Educational Dataset (b) WeNet Dataset 4.3. Experiments We apply DABIP to both domains. In the educational domain, the algorithm chooses personalized mathematics questions, based upon past student performance, that are likely to be answered correctly by the student. In the WeNet domain, the algorithm chooses, based upon users’ past behaviour, personalized interventions that are likely to increase users’ future engagement beyond a median based threshold. We compared DABIP to the LOCB baseline on both datasets. LOCB is available in open source 2 which we extended and adapted to operate on our datasets. 5. Results and Analysis We compare the performance of DABIP and LOCB on the two datasets. As shown in Figure 1a, DABIP outperforms the LOCB baseline by about 25% on the education dataset. The DABIP-Dyn approach uses only the deep diversity features and shows comparable results to DABIP for this dataset. For the WeNet dataset, DABIP outperforms LOCB by about 30%. Additionally, DABIP demonstrates an improvement of more than 75% when compared to a random approach which chooses interventions randomly.

Our results show that identifying and extracting feature dynamics can improve RL algorithm performance, harnessing human diversity proxy information. We argue that identifying the highly dynamic features allows DABIP to search the space of context-reward associations more completely and more quickly, thus leading to better reward. This theory requires further testing, but the results of applying DABIP to real data are promising, and further research into augmenting our clustering approach is planned for the future. 2https://github.com/banyikun/LOCB 6. Conclusion In this work, we designed, implemented, and tested DABIP, a diversity aware RL algorithm that uses feature dynamics as a proxy for underlying human-contextual diversity. We hypothesized that this technique could improve RL algorithms that operate in environments where user data is highly dynamic, and this proved true when applying DABIP to two diferent domains. We believe that extensions to DAABIP can make it an ideal tool for building more performant personalized applications.

Acknowledgements This work was supported in part by the European Union Horizon 2020 WeNet research and innovation program under grant agreement No 823783. [16] S. Li, W. Chen, K.-S. Leung, Improved algorithm on online clustering of bandits, arXiv preprint arXiv:1902.09162 (2019). [17] S. Li, A. Karatzoglou, C. Gentile, Collaborative filtering bandits, in: Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval, 2016, pp. 539–548.

A. The DABIP Algorithm We now give a detailed description of the algorithm. DABIP (Algorithm 1) is initialized with the number of clusters to maintain () , the frequency with which to update the clusters ( frequency with which to update the user feature dynamics ( ), and an exploration parameter ( ). ), the Then, all users are initialized (Lines 2-4) and the algorithm begins iterating over all timesteps sequentially (Line 5). In each round, t, a user is presented along with the set of context vectors (Line 6). DABIP begins without any user clusters. DABIP first checks if there are any clusters (Line 7), and if there are none (length( ≤ 0 )), then the arm with the highest upper confidence bound (UCB) is chosen. As is standard practice [ 10 ] in bandit algorithms, UCB is computed using the estimation of user ’s unknown bandit parameter, ,̂ (Lines 14-16) where ,−−11 is the covariance matrix and ,−1 is a normalizing matrix for user at timestep − 1 that are used to compute the ridge regression solution of the coeficients [ 10 ]. On the other hand, if a user clustering has been established (length( > 0 )), then the cluster holding user is set as , , which represents the unknown bandit parameter for the entire cluster (Line 9). (Line 8) and DABIP calculates ̂ ,

Finally, to choose an arm, we compare the UCB using the user’s unknown bandit parameter, ,̂ to the UCB using the average unknown bandit parameter of all users in cluster , , ̂ 10-12). The maximum of these two UCB values is selected (Line 13). The reasoning behind this , (Lines is that previous work has established that clustering users by unknown bandit parameter is an efective strategy for identifying users who behave similarly in a task, thus resulting in a collaborative filtering efect [

15, 11, 16, 17, 4 ]. In datasets where changes in user features are not available or considered, these past works still represent the state of the art in clustering bandit algorithms. Our approach, by comparison, is to gain an advantage in datasets where user feature dynamics are available and changing. In these cases, we expect the collective bandit parameter of the cluster where user resides, ̂ , , to estimate expected behavior better than ,̂ .

With an arm chosen and pulled, we observe the reward, , then update user parameters and cluster parameters for the cluster that user resides in (Lines 17-22). Then, any user features, , are updated (Lines 23-24). This step will be tailored to the specific implementation and dataset, as the number, type, and sophistication of the user features will be entirely dependent on the problem definition and setup. The count for how many times user has been considered is also updated (Line 25). Finally, the most up to date clusters, , are calculated and returned by the CLUSTER function (Line 26 - see Algorithm 2), which ends round t.

The second component of DABIP is clustering users based upon the similarity of their feature dynamics. The CLUSTER algorithm (Algorithm 2) assumes that each user has a set of features, F, of length q such that at any time, t, there exists , = { ,1, , ,2, ..., ,, }. The values of each individual user feature, ,,

may change over time, which can be tracked to cluster users based upon the similarity of their feature dynamics. To do this, one can observe the value of a feature at some initial timestep, then again at a later timestep, and calculate the absolute value of the diference between them. More formally, at some initial timestep, , we store the values of all features for a given user, : ,

. We also initialize a set that contains one value for each a recommendation to user . Thus, each time user is selected by the algorithm, we can update , based upon the observed user features at timestep t, and increment , by 1. Once the agent , c l u s t e r u p d a t e f r e q u e n c y , u s e r f e a t u r e d y n a m i c s R e q u i r e : u p d a t e f r e q u e n c y , e x p l o r a t i o n p a r a m e t e r

←

0 1 : e a c h f o r d

o ∈

2 : ← , ←

0 3 : , 0 , 0 ←

0 4 :

f o r d

o ← 1 , 2

... 5 :

r e c e i v e a n d o b t a i n ∈ = { , ..., } 6 : 1 , 2 ,

, l e n g t h o

f i f t h e

n ≥

0 7 :

C l u s t e r w h e r e r e s i d e s a t r o u n d

t ←

8 : ,

1 ̂ −

1 ∑ ←

9 : , −

1 , −

1 ∈ |

| , , −

1 , − 1

1 ̂ −

1 w h e r e ∑ ← + ← 1 0 : ∈ , , ,

, , , −

1 ∈

√ |

| , , , , , −

1 , − 1 ̂ −

1 ←

1 1 : , −

1 , −

1 ̂ −

1 w h e r e ← + ← 1 2 : ∈ , , ,

, , , −

1 √ ,

← ( ,

) 1 3 :

e l s e 1 4 : ̂ −

1 ←

1 5 : , −

1 , −

1 ̂ −

1 w h e r e ← + ← 1 6 : ∈ , , ,

, , , −

1 √ ,

p u l l a n d o b s e r v e r e w a r d

1 7 :

−

1 ← +

1 8 : , , − 1

← +

1 9 : , , − 1

l e n g t h o

f i f t h e

n ≥

0 2 0 : −

1 ← +

2 1 : , , − 1

, ← +

2 2 : , , − 1

, f o r d

o ∈

2 3 : , ,

, u p d a t e a c c o r d i n g t o i n f o r m a t i o n g a t h e r e d f r o m p r o b l e m s e t u p a n d 2 4 : , , ← +

1 2 5 : ,

, ← , ) 2 6 :

h a s m a d e a r e c o m m e n d a t i o n t o a u s e r t i m e s , s a y a t t i m e , t h e f e a t u r e

, d y n a m i c s f o r u s e r i , , c a n b e c o m p u t e d b a s e d u p o n h o w t h e f e a t u r e s h a v e c h a n g e d b e t w e e n a n d ( A l g o r i t h m 2

L i n e 2 ) .

T h e d ife r e n c e s a r e s u m m e d o v e r t i m e t o c o m p u t e a n d i s a h y p e r p a r a m e t e r t h a t c o n t r o l s h o w o tfe n u s e r f e a t u r e d y n a m i c s a r e u p d a t e d .

A tfe r t h i s c a l c u l a t i o n , i s s e t t o a n d i s s e t t o 0 .

T h e p r o c e s s r e p e a t s w h e n = u n t i l a l l , , t i m e s t e p s a r e c o m p l e t e .

B y p e r f o r m i n g t h i s o p e r a t i o n f o r e v e r y u s e r, w e c o n s t a n t l y h a v e a c c e s s t o w h i c h r e p r e s e n t s t h e c u r r e n t d y n a m i c s o f u s e r i ’ s f e a t u r e s .

W e u s e t h e s i m i l a r i t y b e t w e e n u s e r ’ s v a l u e s t o c l u s t e r t h e m t o g e t h e r, r a t h e r t h a n a s d o n e i n p r e v i o u s w o r k s [ 1 5 , 1 6 , 4 ] .

T o t h a t e n d o r = { ...

} 1 , 2 , , s i m p l i c i t y , w e a s s u m e t h a t e a c h u s e r m u s t a p p e a r i n e x a c t l y o n e c l u s t e r a n d a l l u s e r s a r e s p l i t 2: 3: 4: 6: 7: ← 5: if t %

== 0 then ← sort in ascending order + 1 and the rest of size ℎ() 8: return

for the full clustering pseudocode. DABIP updates clusters after a period of timesteps have passed . This is because calculating the dynamics of the user features requires observing changes in those features over a period of time. To re-cluster after every timestep would not allow suficient time to observe any true dynamics, so we update for each user after every timesteps in which that user is selected.

Algorithm 2 Require: user feature dynamics update frequency , user update counts Y, cluster update frequency

, user 1: if ==

then = ∑ =1 {| , − ,

|} ← split( ,s) where split(x,y) splits x into ℎ()% groups each of size ℎ()

[1]

Chofin ,

Popineau ,

Bourda ,

J.-J.

Vie , Das3h: modeling student learning and forgetting for optimally scheduling distributed practice of skills , arXiv preprint arXiv: 1905 . 06873 ( 2019 ).

[2]

Nakagawa ,

Iwasawa ,

Matsuo , Graph-based knowledge tracing: modeling student proficiency using graph neural network , in: 2019 IEEE/WIC/ACM International Conference On Web Intelligence (WI), IEEE, 2019 , pp. 156 - 163 .

[3]

Schelenz , I. Bison,

Busso , A. De Götzen , D.

Gatica-Perez , F.

Giunchiglia , L.

Meegahapola , S.

Ruiz-Correa , The theory, practice, and ethical challenges of designing a diversity-aware platform for social relations , in: Proceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society , 2021 , pp. 905 - 915 .

[4]

Ban ,

He , Local clustering in contextual multi-armed bandits , in: Proceedings of the Web Conference 2021 , 2021 , pp. 2335 - 2346 .

[5]

R. S.

Sutton ,

A. G.

Barto , Reinforcement learning: An introduction , MIT press, 2018 .

[6]

Bison ,

Bidoglia ,

Busso ,

R. C.

Abente ,

Cvajner ,

M. D. R.

Britez , G. Gaskell, G. Sciortino,

Stares , et al., D1 . 3 final model of diversity: Findings from the pre-pilots study ( 2021 ).

[7]

D. A.

Harrison ,

K. H.

Price ,

M. P.

Bell , Beyond relational demography: Time and the efects of surface-and deep-level diversity on work group cohesion , Academy of management journal 41 ( 1998 ) 96 - 107 .

[8] S. E . Jackson ,

V. K.

Stone ,

E. B.

Alvarez , Socialization amidst diversity-the impact of demographics on work team oldtimers and newcomers , Research in organizational behavior 15 ( 1992 ) 45 - 109 .

[9]

Kun , A. De Götzen , M.

Bidoglia , N. J.

Gommesen , G. Gaskell, Exploring diversity perceptions in a community through a q&a chatbot , in: DRS2022: Bilbao, Design Research Society , 2022 .

[10]

Li ,

Chu ,

Langford ,

R. E.

Schapire , A contextual-bandit approach to personalized news article recommendation , in: Proceedings of the 19th international conference on World wide web , 2010 , pp. 661 - 670 .

[11]

Gentile ,

Li ,

Kar ,

Karatzoglou ,

Zappella , E. Etrue, On context-dependent clustering of bandits , in: International Conference on Machine Learning, PMLR , 2017 , pp. 1253 - 1262 .

[12]

Wang ,

Lamb , E. Saveliev,

Cameron ,

Zaykov ,

J. M.

Hernández-Lobato ,

R. E.

Turner ,

R. G.

Baraniuk ,

Barton ,

S. P.

Jones , et al., Instructions and guide for diagnostic questions: The neurips 2020 education challenge , arXiv preprint arXiv: 2007 . 12061 ( 2020 ).

[13] A. De Götzen , P.

Kun , L.

Simeone , N.

Morelli , 21 mediating social interaction through a chatbot to leverage the diversity of a community, Artistic Cartography and Design Explorations Towards the Pluriverse ( 2022 ) 234 .

[14]

Giunchiglia , I. Bison,

Busso ,

Chenu-Abente ,

Rodas ,

Zeni ,

Gunel ,

Veltri , A. De Götzen , P. Kun , et al., A worldwide diversity pilot on daily routines and social practices ( 2020 ) ( 2021 ).

[15]

Gentile ,

Li , G. Zappella, Online clustering of bandits , in: International Conference on Machine Learning, PMLR , 2014 , pp. 757 - 765 .