1. Introduction

Prediction via Neural Networks

Sunghyun Kim

sunghyun@mit.edu 0 3 4

Minje Jang

jangminje427@krafton.com 0 2 3

Changho Suh

chsuh@kaist.ac.kr 0 1 3 0 Environments (ComplexRec) Joint Workshop @ RecSys 2021 1 Korea Advanced Institute of Science and Technology , 291 Daehak-ro, Yuseong-gu, Daejeon, 34141 , South Korea 2 Krafton , 231 Teheran-ro, Gangnam-gu, Seoul, 06142 , South Korea 3 Main Contributions. First , we identify a key com- 4 Massachusetts Institute of Technology , 77 Massachusetts Avenue, Cambridge, MA, 02139 , USA

We consider the group match prediction problem where the goal is to estimate the probability of one group of items preferred over another, based on partially observed group comparison data. Most prior algorithms, each tailored to a specific statistical model, sufer from inconsistent performances across diferent scenarios. Motivated by a key common structure of the state-of-the-art, which we call the reward-penalty structure, we develop a unified performances across a wide range of scenarios. Our distinction lies in introducing neural networks embedded with the reward-penalty structure. Extensive experiments on synthetic and real-world datasets show that our framework consistently leads to the best performance, while the state-of-the-art perform inconsistently.

group match prediction group recommendation neural network

1. Introduction

We often compare a pair of items (or groups of items) and ested in predicting comparison outcomes for unobserved pairs, based on partial comparison data for previously observed pairs, and in recommending select options in preference to the others. We investigate the group match prediction (group recommendation) problem where we aim to estimate the probability of one group of items preferred over another, based on partially observed group comparison data.

Most prior algorithms postulate ground-truth utilities for items, and assume underlying models to exist. These models are considered to describe the utility of a group as a function of the utilities of its items, and govern statistical patterns of comparison data based on the group oped for prominent yet specific statistical models, and shown to attain optimal performances (see Section 4.1).

Yet, major challenges arise when we attempt to employ them in practice. mances. Tailored to specific models, they achieve decent performances in some scenarios that are well-represented by their models, but perform poorly in others. This inconsistency limits the application of each algorithm to a narrow range of scenarios. losing (or being less preferred) in an advantageous one. performance improvements, we conduct ablation studies Also, the magnitudes of such rewards and penalties are (Table 2 in Section 5.2). As our baseline, we consider our proportional to its contribution to its group. It turns framework where the reward and penalty modules are out that across diferent state-of-the-art algorithms, only arbitrarily made defunct (Section 5.2 for details). Evaluatthe terms corresponding to rewards and penalties vary ing the performances of our framework and the baseline (Section 4.1 for details), but the common mechanism re- on all five real-world datasets in terms of both cross enmains unchanged. We take it as the main structure in tropy loss and prediction accuracy, we demonstrate that our framework ((5) in Section 4.2 for details). the reward and penalty modules consistently improve

Neural networks: Motivated by the above observations, performances across all datasets and metrics, confirming we incorporate three separate neural network modules their efectiveness. into our framework (Sections 4.2 and 4.3 for architectural details): two modules represent reward and penalty mechanisms, and the other combines them to produce 2. Related Work ifnal comparison outcome predictions. These modules are trained according to the given dataset so that they The most relevant prior works are [ 1, 2, 3, 4 ]. These works can adapt to any hidden models that underlie the dataset, assume prominent models (some long-established and making our framework attain universal applicability. The some widely used in practice) for in-group interactions overall architecture that puts all three modules together and group comparisons, and carry out statistical analysis is specifically designed to maintain the reward-penalty to a great extent. We compare our framework with the structure, in order to retain high performances across algorithms developed therein. various scenarios. The problem of estimating individual item utilities

Scalability: A prior work employed a single-layer neu- from group comparison data has been investigated in ral network as an initial efort to predict winning prob- [ 1, 2 ]. They considered extensions of the BTL model abilities of group matches [ 7 ]. This attempt showed a where the group utility is either the sum or the product promising result, demonstrating improved prediction ac- of its items’ utilities, and group comparisons follow the curacy on a real-world online game dataset. However, it BTL model. We call the two models the BTL-sum and exhibits scalability issues. It requires one input node per BTL-product models respectively. item, thus is too prohibitive to be extended for scenarios A more advanced in-group interaction model has been with large numbers of items (scaling with ). In contrast, developed in [ 3 ]. They considered a scenario where a we design our neural network modules carefully to avoid pair of individual items in a group leads to a synergy. The such scalability issues (Figures 1 and 2: the input dimen- group utility is represented as the sum of two quantities, shiiognhserarpee rfixfeodrmtoanc2e s croemgapradrleedsstootfh e).prIitoarlswooprkro.vides witehmichutwilietiecsalalntdhe(2H)tOhIe msuomdeolf: t(h1e) pthroedsuucmtsooff ianlldpivaiidrsuoafl

Extensive experiments: We compare our framework individual item utilities. The work of [ 9 ] has considered with the single-layer baseline and other algorithms by ex- a scenario where any -tuple of items leads to a synergy. tensive experiments on synthetic and real-world datasets. In [ 4 ], they assumed individual item utilities to be cenUsing synthetic datasets (Section 5.1), we show that our tered around a mean following a Gaussian distribution framework achieves the best performance across four and viewed the group utility as their sum, which we call prominent models in the literature, while the other al- the Thurstone model. Their algorithm is widely used gorithms sufer from inconsistent performances across in skill rating systems of online games where groups of diferent models. Three models consider extensions of users compete. the Bradley-Terry-Luce model [ 8 ] to the group compari- The work of [ 7 ] has made an initial efort in employson scenario. The other considers a generalized version ing a neural network to predict winning probabilities of of the Thurstone model [ 4 ] widely used in skill rating group matches. It has been shown that a single-layer neusystems of online games. Using real-world datasets (Sec- ral network can fit some variants of the BTL model [ 2 ] tion 5.2), we show that our framework performs con- and improve prediction accuracy through experiments sistently well across diverse real-world scenarios. We on a real-world online game dataset. As noted, one clear consider five real-world datasets (sources in Footnote 7). distinction compared to our work lies in scalability. The One is a crowd-sourced image classification dataset, an- neural network developed in the prior work requires one other is a collection of movie ratings, and the other three input node per item, thus is prohibitive for scenarios with are online game match records. In our performance eval- large numbers of items. In contrast, we develop a neural uations, we use the cross entropy loss (see (1)) and the network that requires only a fixed number of input nodes prediction accuracy (see (8)). regardless of the number of items (Section 4.2 for details).

Ablation studies: To verify that the reward-penalty structure in our framework plays a key role in bringing

3. Problem Setup

(, ,

The goal is to predict comparison outcome likelihoods for unobserved pairs of groups, given partial group comparison data. Each data point consists of (, , ), ). and are groups of individual items. indicates which group is preferred: = ( ≻ ) where ≻ indicates

preferred over . We denote the set of observed comparisons by obs and that of unobserved comparisons by unobs. Our training dataset Then formally, our goal is to minimize: available is {(, , of Pr [ tropy loss as our metric. Let us define ̂ as the estimate = 1] produced by our algorithmic framework. )}(,)∈ obs . We use the cross en−1 | obs| (,)∈ ∑ obs log ̂ + (1 − )log(1 − ̂ ).

Notation. [] = {1, 2, … , } is the set of all items. We use lowercase letters (e.g., ) for individual items, and uppercase letters (e.g., ) for sets of items. { }∈ is the set of ’s for all ∈ . [ ]∈ is a vector of ’s for all ∈ and its ordering is provided in the context. ̂ is an estimate of . We use subscripts as in when concerns a comparison between and . We use superscripts as in (ℓ) when is updated iteratively. We use bold symbols as in to indicate the vector of ’s for all ∈ [] .

4. Proposed Algorithm 4.1. Motivation

Our framework design is inspired by optimal state-of-theart algorithms (achieving minimal sample complexity or global minima of cross entropy loss) in extensions of the Bradley-Terry-Luce model (BTL) model [ 8 ]. We make the following key observations regarding the structural similarities shared by the state-of-the-art: () They exhibit “reward” and “penalty” terms in the estimation process (details in (2) and (3)). In updating an item’s utility estimate, they reward the item for contributing to its group’s winning, and penalize it for contributing to its group’s losing. () The expressions of these reward/penalty terms vary

as the underlying models change. () The magnitudes of rewards and penalties depend on the power dynamics between groups. A greater reward is given to an item when its group is relatively weaker compared to the opponent group, and likewise a greater penalty is given when its group is relatively stronger. () Their magnitudes depend on the portion of an item’s contribution (or blame) within its group. Suppose an item’s group wins (or loses) in a comparison. The item is given a greater reward (or penalty) when its share of contribution (or blame) within the group is relatively greater.

Reward-Penalty Structure. Let us examine the algorithms developed in extensions of the BTL model. • Rank Centrality [ 5 ] has been developed under the standard BTL model where individual items are compared in pairs. It achieves the minimal sample complexity for top- rank aggregation, whose task is to estimate the set of top- items, in certain regimes [ 10, 11 ]. As in Section 3, we define

= ( ≻ ) , given a pair of items and . By some manipulation, the update rule at step ℓ for individual item utility (ℓ) of item can (1) be shown as1: (ℓ+1) ← (ℓ)+ 1As in [ 5 ], =

1 max where is the number of distinct items to which item is compared. Also, we describe Rank Centrality as an iterative algorithm (one way to obtain the stationary distribution of the empirical pairwise preference matrix) to highlight its inherent reward-and-penalty structure. = (∑(,)∈ obs∶∈ ( (ℓ) + (ℓ))−1)−1.

∑ ∶(,)∈ obs ( (ℓ)− (1 − ) (ℓ)) .

(2) For item , one can interpret (ℓ) as the reward since it increases (ℓ+1) when ≻ ( = 1), and (ℓ) (next to (1 − )) as the penalty since it decreases (ℓ+1) when ≺ ( = 0). is a step size in the update.

Note that the reward is large when the opponent’s utility estimate is large; we give a large reward for a win against a strong opponent (observation () above).

Likewise, the penalty is large when its own utility estimate is large; we give a large penalty for a loss against a weak opponent (observation () above). • Majorization-Minimization (MM) for the BTL-sum model has been developed in [ 6, 1 ]. We define (ℓ) ∶= ∑∈ (ℓ). By some manipulation, the update rule at step ℓ for individual item utility (ℓ) of item can be shown as2: (ℓ) ). Note also that the larger times to update (ℓ) as follows:

Operation: Recall that our comparison data samples do not include utility estimate information. Each sample only specifies the pair of groups compared (, ) and the comparison outcome ̂ . Thus, we begin with a randomly initialized utility estimate vector (0) ∈ ℝ (initialization details in Section 4.4). Starting from (0), we obtain () ∈ ℝ by applying modules R and P repeatedly We introduce two separate modules to represent re- are given as follows: R(ℓ) .

At step ℓ, given (ℓ), R and P produce positive real values in [ 0, 1 ]. They are reward and penalty values respectively, and are used to obtain (ℓ+1) as per (5). They opponent group’s utility estimate is large (see (ℓ) in the numerator of R, () above). The same holds for the penalty. the contribution of item within its own group (see (ℓ)

/ (ℓ)of R(, ℓ) ), the greater the reward (observation • MM for the BTL-product model has been developed in [ 2 ] and shown to achieve global minima in terms of cross entropy loss. Its individual item utility update rule is described as in (3) but the reward and penalty terms are diferent. 3 A similar interpretation applies, and all observations () –() can be found. wards and penalties respectively. We employ neural networks for the modules so they can adapt to the underlying models for any given scenario and dataset.

4.2. Modules R and P

R [wj(ℓ)]j∈B ilnapyeurt ure 3 illustrates the process. See the first stage therein.

Scalability: The input and output dimensions (2 ) are independent of the total number of items ( ). Thus, our framework does not sufer from scalability issues in contrast to a prior neural network based approach [ 7 ] in which the input dimension of the neural network therein scales in proportion to . This scalability issue for the prior approach manifests in our real-world datasets with large numbers of items. See our real-world data experiment results for GIFGIF and IMDb 5000 datasets in Table 1 in Section 5.2.

Refinement: to obtain a final utility estimate vector () . The best value of is set via hyper-parameter tuning. This series

We apply R and P multiple times ( > 1 )

Data augmentation: R and P take as input a con, catenation of two vectors [ (ℓ)]∈ and [ (ℓ)]∈ . As arbitrarily according to the scenario of interest. R (P re- of applications is to “refine” utility estimates. and produces as output the current reward (penalty re- they are vectors, not sets, ordering matters. We apspectively) estimates for the items. All layers are fully ply data augmentation to make our framework ronumerator is determined by hyper-parameter tuning. tra samples such as ( ′ = (2, 1), ′ = (4, 3), ′ ′ = 1).

We also seek robustness against arbitrary orderings of the groups.

Specifically, we create extra samples by changing the order of two sets

and as in ( = (3, 4), = (1, 2), ( ′ = (4, 3), ′ = (2, 1), ′ ′ = 0).

= 0) and ′ and ′ as in

These techniques derings within a group and group orderings.

Ablation study: R and P are the most distinctive features of our framework. To evaluate their efectiveness, we conduct ablation studies. See Table 2 in Section 5.2.

a sample ( = (1, 2), = (3, 4), = 1), we create ex- probability estimate of preferred over : ̂ = G ([ (ℓ)]∈ , [ (ℓ)]∈ ) .

(7) As in (6), module G takes as input a concatenation of two vectors [ (ℓ)]∈ and [

(ℓ)]∈ . The item and group orderings for the input to G are the same as those for the as input to module G are obtained from the operation of modules R and P. w(ℓ) help train R and P so as to be robust against item or- input to R and P. As mentioned, the item utilities used

4.3. Module G

[wi(L)] i∈A

ReLU

ReLU items within a group) lead to group utilities. Group com- denote by train the data in batch ∈ [] . ferred over the other. The three modules interact closely () Refinement:

For all samples (, ) ∈ be a fraction (1%–2%) of obs and use it for validation purposes. We divide train into equal-size batches. We () Initialization: Initialize (0) using a Gaussian distribution whose mean is zero and variance is the normalized identity matrix, and the parameters of R, P and G using the Xavier initialization [ 14 ].

train, start initialized in to perform the task. The overall architecture that ties them all will soon be depicted at the end of this section.

Structure: Figure 2 depicts the detailed architecture of module G. The input and output of module G are of dimension 2 and a scalar respectively. As in modules R and P, we set the value of

as per the given scenario of interest. As the dimensions are independent of the total number of items ( ), the module does not sufer from scalability issues. All layers are fully connected. The activations between layers are rectified linear units [ 12 ]. () The final activation is the sigmoid function [ 13 ].

Operation: Given a group comparison (, )

, the module takes as input the utility estimates of the items in groups and , and produces as output the winning () () () () with {([ (0) ]∈ , [ (0) ]∈ )}(,)∈ train () , apply R and P modules

times, and obtain {([ ]∈ , [

]∈ )}(,)∈ () Prediction: For all samples (, ) ∈ {([ ]∈ , [

]∈ )}(,)∈ ply G module, and obtain { ̂ train, start with obtained in () , ap} .

train Model parameter update: Update the parameters of R, P, and G via the Adam optimizer [ 15 ] to minimize the loss. To compute the loss, replace obs in (1) by train. Apply weight decay regularization with a factor of 0.01. Repeat () –() for all batches ∈ [] .

. train train The parameters are updated times using train once in one epoch. We use 500 epochs. In each, we compute a validation loss using val. We apply early stopping and choose the parameters with the lowest loss.

5. Experimental Results

To verify the broad applicability of our framework, we conduct extensive experiments using synthetic and real-world datasets. We compare it with six other algorithms: MM-sum [ 1 ], MM-prod [ 2 ], SGD-HOI [ 3 ], TrueSkill [ 4 ], Rank Centrality [ 5 ], and an algorithm based on a single-layer neural network [ 7 ]. SGD-HOI has been developed for the factorization HOI model in [ 3 ] and TrueSkill for the Thurstone model in [ 4 ]. As Rank Centrality has been developed for the case of comparing two individual items, we consider its natural extensions depending on the dataset. For example, in the sum model, we replace the summation in (2) by ∑(,)∈

5.1. Synthetic Data Experiments

HOI and a generalized Thurstone. We set = 300 and = 5 . In the HOI model, we generate the ground-truth utilities and dimension-7 features using Gaussian distributions. In the others, we generate the ground-truth utilities uniformly at random. We generate 5 log distinct paired groups and each pair is compared 10 times.6

We split generated datasets randomly into obs (90%) unobserved group comparisons in unobs. They use a fraction (1%–2%) of obs for validation purposes. We use = 20 and the learning rate of 10−2. We use four hidden layers for all modules. For R and P, each layer consists of 7 nodes, and for G, each layer consists of 9 nodes.

5.2. Real-World Data Experiments

As in Section 5.1, we split real-world datasets randomly We use four synthetic datasets: BTL-sum, BTL-product, but performs best when we have suficient data. This is and unobs (10%). All algorithms use obs to predict its optimality has not been shown in the literature. nodes represent items and an hyper-edge represents a group comparison [ 16 ]. Without connectedness guarantees, some algorithms may fail to produce meaningful estimates.

Ti (Single GPU).

6The 5 log is chosen in order to guarantee connectedness be- into obs (90%) and unobs (10%), and use a fraction tween any pair of nodes within a comparison (hyper-)graph where (1%–2%) of obs for validation purposes if necessary. We use five real-world datasets 7: GIFGIF, HOTS, DOTA 2, LoL, records.7 Two groups with five players each compete. IMDb 5000. We use = 30, 15, 15, 20, 20 and the learning The players choose heroes out of a pool of 140. There are rates of 10−3, 10−3, 10−2, 10−2, 10−2 respectively. We use 7,610 records. (5) IMDb 5000: A collection of meta-data four hidden layers for all modules. Each layer consists of for 5,000 movies.7 Each movie has a score and is asso7 nodes for R and P, and 9 nodes for G. ciated with keywords. To fit our purpose, we generate

Let us discuss each dataset. (1) GIFGIF: A crowd- match records for movie pairs. We consider each movie sourcing project.7 We use the dataset pre-processed in as a group and its five keywords as its items. Given a [ 17 ]. A participant is presented with two images and pair, we declare a win for the one with a higher score. asked to choose one which better describes a given emo- We have 8,021 keywords and 123,420 samples. tion.8 This dataset belongs to a special case of our in- In addition to the cross entropy loss, we also consider terest as individual comparisons are concerned. We con- another metric highly-relevant in practice: prediction sider the emotion of happiness. We have 6,120 images accuracy. We declare it a win for a group if the estimate and 106,886 samples. (2) HOTS: A collection of HOTS of its winning probability is above a certain threshold, match records from 10/26/17 to 11/26/17 collected by which we set as 0.5 [ 18 ]. Thus, the prediction accuracy HOTS logs.7 Each match consists of two groups with five is expressed as follows9: players each. The players choose heroes for each match out of a pool of 84. We choose high-quality matches only 1 ∑ ≥ 1 ( ̂ ) + (1 − ) < 1 ( ̂ ). where all players are highly-skilled according to some | unobs| (,)∈ unobs 2 2 available statistics. There are 26,486 records. (3) DOTA (8) 2: A collection of DOTA 2 match records.7 Each match consists of two groups with five players each, and they choose heroes out of a pool of 113. There are 50,000 records. (4) LoL: A collection of LoL professional match

Table 1 shows our result. We boldface the best performances and underline the second-best. The numbers in parentheses indicate the ranks among the algorithms being compared in a given setup of dataset and metric.

Our framework consistently yields the top performances 7gifgif.media.mit.edu (GIFGIF); hotslogs.com/Info/API (HOTS); kaggle.com/devinanzelmo/dota-2-matches (DOTA 2); kaggle.com/chuckephron/leagueoflegends ( LoL); kaggle.com/carolzhangdc/imdb-5000-movie-dataset (IMDb 5000).

8“neither” is also allowed, but we exclude such data. 9We define ≥ 1 () as 1 if ≥ 1 and as 0 otherwise. Similarly,

2 2 < 1 () equals to 1 if < 12 and to 0 otherwise.

2 in all cases, while the others sufer from significantly inconsistent performances across datasets and/or metrics.

To verify that our design of modules R and P plays a critical role in achieving consistently high performances across various scenarios, we conduct ablation studies.

As our baseline, we consider the case where we refine initialized estimates (0)using R and P once ( = 1 ), data augmentation is not applied, and the model parameters of R and P are not updated. Thus, module G takes as input coarse utility estimates and produces as output a group comparison outcome prediction. This baseline can be viewed as a naive neural network based approach without the key reward-penalty mechanisms.

Table 2 shows our result. The best performance improvements (boldfaced) reach up to 12.291% in cross entropy loss and 14.764% in prediction accuracy. The average performance improvements are 4.511% in cross entropy loss and 7.12% in prediction accuracy. Note that across all datasets and metrics, it is empirically demonstrated that the reward-penalty structure by modules R and P brings consistent improvements.

6. Conclusion

We explore the group match prediction problem where the goal is to predict the probability of one group preferred over the other given an unseen pair of groups, based on partially observed group comparison data. We develop a unified algorithmic framework that employs neural networks and show that it can yield consistently best performances compared to other state-of-the-art algorithms on multiple datasets across various domains.

Acknowledgments

This work was supported by the ICT R&D program of Institute of Information & Communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (No. 2020-0-00626).

[1]

T. K.

Huang ,

C. J.

Lin ,

R. C.

Weng , Generalized Bradley-Terry models and multi-class probability estimates , Journal of Machine Learning Research 7 ( 2006 ) 85 - 115 .

[2]

T. K.

Huang ,

C. J.

Lin ,

R. C.

Weng , Ranking individuals by group comparisons, Journal of Machine Learning Research 9 ( 2008 ) 2187 - 2216 .

[3]

Li , M. Cheng,

Fujii ,

Hsieh , C.-J. Hsieh , Learning from group comparisons: Exploiting higher order interactions , Advances in Neural Information Processing Systems ( 2018 ) 4981 - 4990 .

[4]

Herbrich ,

Minka , T. Graepel, Trueskill™: A bayesian skill rating system , Advances in neural information processing systems ( 2007 ) 569 - 576 .

[5]

Negahban ,

Oh ,

Shah , Rank centrality: Ranking from pair-wise comparisons , Operations Research 65 ( 2016 ) 266 - 287 .

[6]

D. R.

Hunter , MM algorithms for generalized Bradley-Terry models , Annals of Statistics ( 2004 ) 384 - 406 .

[7]

J. E.

Menke ,

T. R.

Martinez , A Bradley-Terry artificial neural network model for individual ratings in group competitions , Neural Computing and Applications 17 ( 2008 ) 175 - 186 .

[8]

R. A.

Bradley ,

M. E.

Terry , Rank analysis of incomplete block designs: I. the method of paired comparisons , Biometrika 39 ( 1952 ) 324 - 345 .

[9]

DeLong ,

Pathak ,

Erickson ,

Perrino ,

Shim ,

Srivastava , Teamskill: Modeling team chemistry in online multi-player games , PacificAsia Conference on Knowledge Discovery and Data Mining ( 2011 ) 519 - 531 .

[10]

Jang ,

Kim ,

Suh ,

Oh , Optimal sample complexity of -wise data for top- ranking , Advances in Neural Information Processing Systems ( 2017 ) 1686 - 1696 .

[11]

Chen ,

Fan , C. Ma,

Wang , Spectral method and regularized MLE are both optimal for top-K ranking , Annals of statistics 47 ( 2019 ) 2204 - 2235 .

[12]

Glorot ,

Bordes ,

Bengio , Deep sparse rectifier neural networks , Proceedings of the fourteenth international conference on artificial intelligence and statistics ( 2011 ) 315 - 323 .

[13]

Hornik ,

Stinchcombe ,

White , Multilayer feedforward networks are universal approximators , Neural networks 2 ( 1989 ) 359 - 366 .

[14]

Glorot ,

Bengio , Understanding the dificulty of training deep feedforward neural networks , International Conference on Machine Learning ( 2010 ) 249 - 256 .

[15]

D. P.

Kingma ,

Ba , Adam: A method for stochastic optimization , arXiv preprint arXiv:1412.6980 ( 2014 ).

[16]

Cooley ,

Kang ,

Koch , Threshold and hitting time for high-order connectedness in random hypergraphs , Electronic Journal of Combinatorics ( 2016 ) 2 - 48 .

[17]

Maystre ,

Grossglauser , Just sort it! A simple and efective approach to active preference learning , International Conference on Machine Learning ( 2017 ) 2344 - 2353 .

[18]

Delalleau ,

Contal ,

Thibodeau-Laufer ,

R. C.

Ferrari ,

Bengio , F. Zhang, Beyond skill rating: Advanced matchmaking in ghost recon online , IEEE Transactions on Computational Intelligence and AI in Games 4 ( 2012 ) 167 - 177 .