Group Match Prediction via Neural Networks Sunghyun Kim1 , Minje Jang2 and Changho Suh3 1 Massachusetts Institute of Technology, 77 Massachusetts Avenue, Cambridge, MA, 02139, USA 2 Krafton, 231 Teheran-ro, Gangnam-gu, Seoul, 06142, South Korea 3 Korea Advanced Institute of Science and Technology, 291 Daehak-ro, Yuseong-gu, Daejeon, 34141, South Korea Abstract We consider the group match prediction problem where the goal is to estimate the probability of one group of items preferred over another, based on partially observed group comparison data. Most prior algorithms, each tailored to a specific statistical model, suffer from inconsistent performances across different scenarios. Motivated by a key common structure of the state-of-the-art, which we call the reward-penalty structure, we develop a unified framework that achieves consistently high performances across a wide range of scenarios. Our distinction lies in introducing neural networks embedded with the reward-penalty structure. Extensive experiments on synthetic and real-world datasets show that our framework consistently leads to the best performance, while the state-of-the-art perform inconsistently. Keywords group match prediction, group recommendation, neural network 1. Introduction Second, no prior algorithm can offer a satisfactory per- formance when a given scenario cannot be represented We often compare a pair of items (or groups of items) and by well-studied models. For instance, underlying models judge which one is of higher utility. We also become inter- for some scenarios can be inherently too complex to be ested in predicting comparison outcomes for unobserved approximated by existing models. In such cases, all prior pairs, based on partial comparison data for previously algorithms are ill-suited. observed pairs, and in recommending select options in We address all these challenges by developing a unified preference to the others. We investigate the group match algorithmic framework that can (𝑖) infer and adapt to any prediction (group recommendation) problem where we underlying models for given scenarios; and (𝑖𝑖) achieve aim to estimate the probability of one group of 𝑀 items consistently high performances. preferred over another, based on partially observed group Main Contributions. First, we identify a key com- comparison data. mon structure (which we call the reward-penalty struc- Most prior algorithms postulate ground-truth utilities ture, to be detailed in Section 4.1) shared among state-of- for items, and assume underlying models to exist. These the-art algorithms, and incorporate the structure into our models are considered to describe the utility of a group framework. This emphasis on structural aspects enables as a function of the utilities of its items, and govern sta- our framework to attain high performances (Section 5). tistical patterns of comparison data based on the group Second, we introduce neural networks so as to enable our utilities [1, 2, 3, 4]. Some algorithms have been devel- framework to infer and adapt to any latent models. We oped for prominent yet specific statistical models, and design the neural networks to well-respect the reward- shown to attain optimal performances (see Section 4.1). penalty structure in order to retain high performances Yet, major challenges arise when we attempt to employ across vairous scenarios. (Sections 4.2 and 4.3) Third, we them in practice. construct the neural networks in such a way that our First, prior algorithms lead to inconsistent perfor- framework can be robust against prohibitive scalability mances. Tailored to specific models, they achieve decent issues. This robustness makes our framework applica- performances in some scenarios that are well-represented ble to a broader range of scenarios which involve large by their models, but perform poorly in others. This in- numbers of items. consistency limits the application of each algorithm to a Insight from the state of the arts: Looking into state- narrow range of scenarios. of-the-art algorithms in the tasks of match prediction and rank aggregation (a related and long-studied task) 3rd Edition of Knowledge-aware and Conversational Recommender [5, 6, 1, 2], we find that they share a key element: so- Systems (KaRS) & 5th Edition of Recommendation in Complex called the reward-penalty mechanism in estimating the Environments (ComplexRec) Joint Workshop @ RecSys 2021, September 27–1 October 2021, Amsterdam, Netherlands utilities of items. The mechanism rewards an item greatly Envelope-Open sunghyun@mit.edu (S. Kim); jangminje427@krafton.com for winning (or being more preferred) in a disadvanta- (M. Jang); chsuh@kaist.ac.kr (C. Suh) geous comparison where its group is weaker than the Β© 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). counterpart. Likewise, it penalizes an item greatly for CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) losing (or being less preferred) in an advantageous one. performance improvements, we conduct ablation studies Also, the magnitudes of such rewards and penalties are (Table 2 in Section 5.2). As our baseline, we consider our proportional to its contribution to its group. It turns framework where the reward and penalty modules are out that across different state-of-the-art algorithms, only arbitrarily made defunct (Section 5.2 for details). Evaluat- the terms corresponding to rewards and penalties vary ing the performances of our framework and the baseline (Section 4.1 for details), but the common mechanism re- on all five real-world datasets in terms of both cross en- mains unchanged. We take it as the main structure in tropy loss and prediction accuracy, we demonstrate that our framework ((5) in Section 4.2 for details). the reward and penalty modules consistently improve Neural networks: Motivated by the above observations, performances across all datasets and metrics, confirming we incorporate three separate neural network modules their effectiveness. into our framework (Sections 4.2 and 4.3 for architec- tural details): two modules represent reward and penalty mechanisms, and the other combines them to produce 2. Related Work final comparison outcome predictions. These modules The most relevant prior works are [1, 2, 3, 4]. These works are trained according to the given dataset so that they assume prominent models (some long-established and can adapt to any hidden models that underlie the dataset, some widely used in practice) for in-group interactions making our framework attain universal applicability. The and group comparisons, and carry out statistical analysis overall architecture that puts all three modules together to a great extent. We compare our framework with the is specifically designed to maintain the reward-penalty algorithms developed therein. structure, in order to retain high performances across The problem of estimating individual item utilities various scenarios. from group comparison data has been investigated in Scalability: A prior work employed a single-layer neu- [1, 2]. They considered extensions of the BTL model ral network as an initial effort to predict winning prob- where the group utility is either the sum or the product abilities of group matches [7]. This attempt showed a of its items’ utilities, and group comparisons follow the promising result, demonstrating improved prediction ac- BTL model. We call the two models the BTL-sum and curacy on a real-world online game dataset. However, it BTL-product models respectively. exhibits scalability issues. It requires one input node per A more advanced in-group interaction model has been item, thus is too prohibitive to be extended for scenarios developed in [3]. They considered a scenario where a with large numbers of items (scaling with 𝑛). In contrast, pair of individual items in a group leads to a synergy. The we design our neural network modules carefully to avoid group utility is represented as the sum of two quantities, such scalability issues (Figures 1 and 2: the input dimen- which we call the HOI model: (1) the sum of individual sions are fixed to 2𝑀 regardless of 𝑛). It also provides item utilities and (2) the sum of the products of all pairs of higher performances compared to the prior work. individual item utilities. The work of [9] has considered Extensive experiments: We compare our framework a scenario where any π‘˜-tuple of items leads to a synergy. with the single-layer baseline and other algorithms by ex- In [4], they assumed individual item utilities to be cen- tensive experiments on synthetic and real-world datasets. tered around a mean following a Gaussian distribution Using synthetic datasets (Section 5.1), we show that our and viewed the group utility as their sum, which we call framework achieves the best performance across four the Thurstone model. Their algorithm is widely used prominent models in the literature, while the other al- in skill rating systems of online games where groups of gorithms suffer from inconsistent performances across users compete. different models. Three models consider extensions of The work of [7] has made an initial effort in employ- the Bradley-Terry-Luce model [8] to the group compari- ing a neural network to predict winning probabilities of son scenario. The other considers a generalized version group matches. It has been shown that a single-layer neu- of the Thurstone model [4] widely used in skill rating ral network can fit some variants of the BTL model [2] systems of online games. Using real-world datasets (Sec- and improve prediction accuracy through experiments tion 5.2), we show that our framework performs con- on a real-world online game dataset. As noted, one clear sistently well across diverse real-world scenarios. We distinction compared to our work lies in scalability. The consider five real-world datasets (sources in Footnote 7). neural network developed in the prior work requires one One is a crowd-sourced image classification dataset, an- input node per item, thus is prohibitive for scenarios with other is a collection of movie ratings, and the other three large numbers of items. In contrast, we develop a neural are online game match records. In our performance eval- network that requires only a fixed number of input nodes uations, we use the cross entropy loss (see (1)) and the regardless of the number of items (Section 4.2 for details). prediction accuracy (see (8)). Ablation studies: To verify that the reward-penalty structure in our framework plays a key role in bringing 3. Problem Setup (𝑑) Their magnitudes depend on the portion of an item’s contribution (or blame) within its group. Suppose The goal is to predict comparison outcome likelihoods an item’s group wins (or loses) in a comparison. The for unobserved pairs of groups, given partial group com- item is given a greater reward (or penalty) when its parison data. Each data point consists of (𝐴, 𝐡, 𝑦𝐴𝐡 ), share of contribution (or blame) within the group is (𝐴, 𝐡, 𝑦𝐴𝐡 ). 𝐴 and 𝐡 are groups of 𝑀 individual items. relatively greater. 𝑦𝐴𝐡 indicates which group is preferred: 𝑦𝐴𝐡 = 𝕀(𝐴 ≻ 𝐡) Reward-Penalty Structure. Let us examine the algo- where 𝐴 ≻ 𝐡 indicates 𝐴 preferred over 𝐡. We denote rithms developed in extensions of the BTL model. the set of observed comparisons by 𝐷obs and that of un- observed comparisons by 𝐷unobs . Our training dataset β€’ Rank Centrality [5] has been developed under the stan- available is {(𝐴, 𝐡, 𝑦𝐴𝐡 )}(𝐴,𝐡)∈𝐷obs . We use the cross en- dard BTL model where individual items are compared tropy loss as our metric. Let us define 𝑦𝐴𝐡 Μ‚ as the estimate in pairs. It achieves the minimal sample complexity of Pr [𝑦𝐴𝐡 = 1] produced by our algorithmic framework. for top-𝐾 rank aggregation, whose task is to estimate Then formally, our goal is to minimize: the set of top-𝐾 items, in certain regimes [10, 11]. As in Section 3, we define 𝑦𝑖𝑗 = 𝕀(𝑖 ≻ 𝑗), given a pair of βˆ’1 items 𝑖 and 𝑗. By some manipulation, the update rule βˆ‘ 𝑦𝐴𝐡 log 𝑦𝐴𝐡 Μ‚ + (1 βˆ’ 𝑦𝐴𝐡 ) log(1 βˆ’ 𝑦𝐴𝐡 Μ‚ ). |𝐷obs | (𝐴,𝐡)∈𝐷 (β„“) obs at step β„“ for individual item utility 𝑀𝑖 of item 𝑖 can (1) be shown as : 1 (β„“+1) (β„“) (β„“) (β„“) Notation. [𝑛] = {1, 2, … , 𝑛} is the set of all items. We 𝑀𝑖 ← 𝑀𝑖 +𝛼 βˆ‘ (𝑦𝑖𝑗 𝑀𝑗 βˆ’ (1 βˆ’ 𝑦𝑖𝑗 )𝑀𝑖 ) . use lowercase letters (e.g., 𝑖) for individual items, and up- π‘—βˆΆ(𝑖,𝑗)∈𝐷obs percase letters (e.g., 𝐴) for sets of items. {𝑀𝑖 }π‘–βˆˆπ΄ is the set (2) of 𝑀𝑖 ’s for all 𝑖 ∈ 𝐴. [𝑀𝑖 ]π‘–βˆˆπ΄ is a vector of 𝑀𝑖 ’s for all 𝑖 ∈ 𝐴 (β„“) For item 𝑖, one can interpret 𝑀𝑗 as the reward since and its ordering is provided in the context. 𝑦̂ is an esti- (β„“+1) (β„“) mate of 𝑦. We use subscripts as in 𝑦𝐴𝐡 when 𝑦 concerns a it increases 𝑀𝑖 when 𝑖 ≻ 𝑗 (𝑦𝑖𝑗 = 1), and 𝑀𝑖 (next (β„“+1) comparison between 𝐴 and 𝐡. We use superscripts as in to (1 βˆ’ 𝑦𝑖𝑗 )) as the penalty since it decreases 𝑀𝑖 𝑀 (β„“) when 𝑀 is updated iteratively. We use bold symbols when 𝑖 β‰Ί 𝑗 (𝑦𝑖𝑗 = 0). 𝛼 is a step size in the update. as in 𝑀 to indicate the vector of 𝑀𝑖 ’s for all 𝑖 ∈ [𝑛]. Note that the reward is large when the opponent’s utility estimate is large; we give a large reward for a win against a strong opponent (observation (π‘Ž) above). 4. Proposed Algorithm Likewise, the penalty is large when its own utility estimate is large; we give a large penalty for a loss 4.1. Motivation against a weak opponent (observation (𝑐) above). Our framework design is inspired by optimal state-of-the- β€’ Majorization-Minimization (MM) for the BTL-sum art algorithms (achieving minimal sample complexity or (β„“) model has been developed in [6, 1]. We define 𝑀𝐴 ∢= global minima of cross entropy loss) in extensions of the (β„“) Bradley-Terry-Luce model (BTL) model [8]. We make βˆ‘π‘–βˆˆπ΄ 𝑀𝑖 . By some manipulation, the update rule at (β„“) the following key observations regarding the structural step β„“ for individual item utility 𝑀𝑖 of item 𝑖 can be similarities shared by the state-of-the-art: shown as2 : (β„“+1) (π‘Ž) They exhibit β€œreward” and β€œpenalty” terms in the es- 𝑀𝑖 ← (3) timation process (details in (2) and (3)). In updating (β„“) (β„“) (β„“) 𝑀𝑖 + 𝛼 𝑖 βˆ‘ (𝑦𝐴𝐡 β‹… R𝐴𝐡,𝑖 βˆ’ (1 βˆ’ 𝑦𝐴𝐡 ) β‹… P𝐴𝐡,𝑖 ) an item’s utility estimate, they reward the item for (𝐴,𝐡)∈𝐷obs contributing to its group’s winning, and penalize it π‘–βˆˆπ΄ for contributing to its group’s losing. where (β„“) (β„“) (β„“) (β„“) (𝑏) The expressions of these reward/penalty terms vary (β„“) 𝑀𝐡 𝑀𝑖 (β„“) 𝑀𝐴 𝑀𝑖 as the underlying models change. R𝐴𝐡,𝑖 = β‹… , P𝐴𝐡,𝑖 = β‹… . (β„“) (β„“) (β„“) (β„“) (β„“) (β„“) 𝑀𝐴 + 𝑀 𝐡 𝑀𝐴 𝑀𝐴 + 𝑀 𝐡 𝑀𝐴 (𝑐) The magnitudes of rewards and penalties depend (4) on the power dynamics between groups. A greater 1 As in [5], 𝛼 = max1 𝑑 where 𝑑𝑖 is the number of distinct items to reward is given to an item when its group is rela- 𝑖 𝑖 which item 𝑖 is compared. Also, we describe Rank Centrality as an tively weaker compared to the opponent group, and iterative algorithm (one way to obtain the stationary distribution of likewise a greater penalty is given when its group the empirical pairwise preference matrix) to highlight its inherent is relatively stronger. reward-and-penalty structure. βˆ’1 2 (β„“) (β„“) βˆ’1 𝛼𝑖 = (βˆ‘(𝐴,𝐡)∈𝐷obs βˆΆπ‘–βˆˆπ΄ (𝑀𝐴 + 𝑀𝐡 ) ) . Operation: Recall that our comparison data samples do not include utility estimate information. Each sample Note that the update rule (3) is similar to (2) of Rank only specifies the pair of groups compared (𝐴, 𝐡) and Centrality, but the reward and penalty terms in (4) are the comparison outcome 𝑦𝐴𝐡 Μ‚ . Thus, we begin with a different (observation (𝑏) above). The interpretation randomly initialized utility estimate vector 𝑀(0) ∈ ℝ𝑛 (ini- is similar. The reward for item 𝑖 is large when the tialization details in Section 4.4). Starting from 𝑀(0) , we (β„“) opponent group’s utility estimate is large (see 𝑀𝐡 obtain 𝑀(𝐿) ∈ ℝ𝑛 by applying modules R and P repeatedly (β„“) in the numerator of R𝐴𝐡,𝑖 ). Note also that the larger 𝐿 times to update 𝑀(β„“) as follows: the contribution of item 𝑖 within its own group (see (β„“+1) (β„“) (β„“) (β„“) 𝑀𝑖 /𝑀𝐴 of R𝐴𝐡,𝑖 ), the greater the reward (observation 𝑀𝑖 ← (5) (𝑑) above). The same holds for the penalty. (β„“) (β„“) (β„“) 𝑀𝑖 + 𝛼 βˆ‘ (𝑦𝐴𝐡 β‹… R𝐴𝐡,𝑖 βˆ’ (1 βˆ’ 𝑦𝐴𝐡 ) β‹… P𝐴𝐡,𝑖 ) , (𝐴,𝐡)∈𝐷obs β€’ MM for the BTL-product model has been developed π‘–βˆˆπ΄ in [2] and shown to achieve global minima in terms of cross entropy loss. Its individual item utility update where4 𝛼 = max𝑐 𝑑 and 𝑑𝑖 = |{(𝐴, 𝐡) ∢ 𝑖 ∈ 𝐴 βˆͺ 𝐡}|. 𝑖 𝑖 rule is described as in (3) but the reward and penalty 3 At step β„“, given 𝑀(β„“) , R and P produce positive real terms are different. A similar interpretation applies, values in [0, 1]. They are reward and penalty values re- and all observations (π‘Ž)–(𝑑) can be found. spectively, and are used to obtain 𝑀(β„“+1) as per (5). They We introduce two separate modules to represent re- are given as follows: wards and penalties respectively. We employ neural net- works for the modules so they can adapt to the underlying (β„“) (β„“) (β„“) models for any given scenario and dataset. R𝐴𝐡,𝑖 = R (𝑖, [𝑀𝑗 ]π‘—βˆˆπ΄ , [π‘€π‘˜ ]π‘˜βˆˆπ΅ ) , (6) (β„“) (β„“) (β„“) P𝐴𝐡,𝑖 = P (𝑖, [𝑀𝑗 ]π‘—βˆˆπ΄ , [π‘€π‘˜ ]π‘˜βˆˆπ΅ ) . R R and P 4.2. Modules At the end of step β„“, we normalize 𝑀(β„“+1) to be zero- (β„“+1) (β„“) (β„“) (t) [RAB,i ]i∈A (β„“+1) (β„“+1) 𝑛 𝑀𝑖 [wi ]i∈A mean 𝑀𝑖 ← 𝑀𝑖 𝑛 βˆ’ βˆ‘π‘–=1 , and unity-norm (β„“+1) (β„“+1) 𝑀𝑖 𝑀𝑖 ← (β„“+1) . We repeat until we obtain 𝑀(𝐿) . Fig- ReLU ReLU ReLU ReLU sigmoid ‖𝑀 β€– 2 ure 3 illustrates the process. See the first stage therein. Scalability: The input and output dimensions (2𝑀) are (β„“) (β„“) (t) [wj ]j∈B input fully connected output layer hidden layers layer [RAB,j ]j∈B independent of the total number of items (𝑛). Thus, our framework does not suffer from scalability issues in con- trast to a prior neural network based approach [7] in Figure 1: Architecture of module R (and P, which is omitted which the input dimension of the neural network therein as it has the same structure with different weights). It takes scales in proportion to 𝑛. This scalability issue for the as input utility estimates for the 2𝑀 items in a pair of groups prior approach manifests in our real-world datasets with (𝐴, 𝐡) and produces as output reward and penalty estimates for the items. large numbers of items. See our real-world data experi- ment results for GIFGIF and IMDb 5000 datasets in Table 1 Structure: Figure 1 depicts the detailed architecture of in Section 5.2. the two modules which reflect rewards (R) and penal- Refinement: We apply R and P multiple times (𝐿 > 1) ties (P) discussed in Section 4.1. The input and output to obtain a final utility estimate vector 𝑀(𝐿) . The best of module R and P are of dimension 2𝑀. 𝑀 can be set value of 𝐿 is set via hyper-parameter tuning. This series arbitrarily according to the scenario of interest. R (P re- of applications is to β€œrefine” utility estimates. spectively) takes as input the current utility estimates of Data augmentation: R and P take as input a con- (β„“) (β„“) the individual items in a given group comparison (𝐴, 𝐡), catenation of two vectors [𝑀𝑗 ]π‘—βˆˆπ΄ and [π‘€π‘˜ ]π‘˜βˆˆπ΅ . As and produces as output the current reward (penalty re- they are vectors, not sets, ordering matters. We ap- spectively) estimates for the items. All layers are fully ply data augmentation to make our framework ro- connected. The activations between layers are rectified bust against arbitrary orderings of the items within linear units [12]. The final activation is the sigmoid func- a group. Specifically, given a sample, we create ex- tion [13]. tra samples which represent the same outcome but (β„“) βˆ’1 (β„“) have different item orderings. For example, given 3 𝑀𝐴 (β„“) (β„“) (β„“) 𝑀𝐡 (β„“) 𝛼𝑖 = (βˆ‘ (β„“) (β„“) ) where 𝑀𝐴 = βˆπ‘– 𝑀𝑖 , R𝐴𝐡,𝑖 = (β„“) (β„“) β‹… 𝑀𝑖 , 𝑀𝐴 +𝑀𝐡 𝑀𝐴 +𝑀𝐡 4 (β„“) Prior work (see Footnote 1) has motivated the choice of 𝛼. The (β„“) 𝑀 (β„“) P𝐴𝐡,𝑖 = (β„“) 𝐴 (β„“) β‹… 𝑀𝑖 . numerator 𝑐 is determined by hyper-parameter tuning. 𝑀𝐴 +𝑀𝐡 a sample (𝐴 = (1, 2), 𝐡 = (3, 4), 𝑦𝐴𝐡 = 1), we create ex- probability estimate of 𝐴 preferred over 𝐡: tra samples such as (𝐴′ = (2, 1), 𝐡′ = (4, 3), 𝑦𝐴′ 𝐡′ = 1). (β„“) (β„“) We also seek robustness against arbitrary orderings 𝑦𝐴𝐡 Μ‚ = G ([𝑀𝑖 ]π‘–βˆˆπ΄ , [𝑀𝑗 ]π‘—βˆˆπ΅ ) . (7) of the groups. Specifically, we create extra sam- ples by changing the order of two sets 𝐴 and 𝐡 as As in (6), module G takes as input a concatenation of in (𝐴 = (3, 4), 𝐡 = (1, 2), 𝑦𝐴𝐡 = 0) and 𝐴′ and 𝐡′ as in (β„“) (β„“) two vectors [𝑀𝑖 ]π‘–βˆˆπ΄ and [𝑀𝑗 ]π‘—βˆˆπ΅ . The item and group (𝐴′ = (4, 3), 𝐡′ = (2, 1), 𝑦𝐴′ 𝐡′ = 0). These techniques orderings for the input to G are the same as those for the help train R and P so as to be robust against item or- input to R and P. As mentioned, the item utilities used derings within a group and group orderings. as input to module G are obtained from the operation of Ablation study: R and P are the most distinctive fea- modules R and P. tures of our framework. To evaluate their effectiveness, 1st stage 2nd stage we conduct ablation studies. See Table 2 in Section 5.2. (β„“) (L) [wi ]i∈A [wi ]i∈A 4.3. Module G w(β„“) (β„“) R (β„“) [RAB,i ]i∈A w(β„“+1) w(L) (L) G [wj ]j∈B zero-mean [wj ]j∈B (β„“) P [PAB,j ]j∈B unity-norm (L) [wi ]i∈A Figure 3: Overall architecture of our framework. ReLU ReLU ReLU ReLU sigmoid Figure 3 illustrates the overall architecture where mod- ules R, P and G are put together. The first stage depicts (L) (t) the operation of R and P (initialization of 𝑀(0) omitted). [w j∈B input [wjj ]j∈B fully connected output layer hidden layers layer The second stage depicts the operation of G. Note that the reward-penalty structure is present. This structure is unaffected during the training process where only the Figure 2: Architecture of module G. It takes as input utility parameters inside the modules are updated. estimates for the 2𝑀 items in (𝐴, 𝐡) (obtained by modules R and P) and produces as output a group comparison prediction. 4.4. Training Procedure To perform our main task, we need more than rewards We split 𝐷obs randomly into 𝐷train and 𝐷val . We let 𝐷val and penalties. As seen in Section 4.1, latent models can be a fraction (1%–2%) of 𝐷obs and use it for validation be assumed to exist. In-group interaction models (among purposes. We divide 𝐷train into 𝐡 equal-size batches. We items within a group) lead to group utilities. Group com- 𝑏 denote by 𝐷train the data in batch 𝑏 ∈ [𝐡]. parison models (between a pair of groups) govern statisti- cal patterns of comparison outcomes. The role of module (π‘Ž) Initialization: Initialize 𝑀(0) using a Gaussian distri- G is to fit these models. Modules R and P help quantify bution whose mean is zero and variance is the nor- and provide item utility estimates. Then module G takes malized identity matrix, and the parameters of R, P as input the item utility estimates for a pair of groups, and G using the Xavier initialization [14]. and produces as output the probability of one group pre- 𝑏 (𝑏) Refinement: For all samples (𝐴, 𝐡) ∈ 𝐷train , start ferred over the other. The three modules interact closely to perform the task. The overall architecture that ties (0) (0) with {([𝑀𝑖 ]π‘–βˆˆπ΄ , [𝑀𝑗 ]π‘—βˆˆπ΅ )}(𝐴,𝐡)∈𝐷 𝑏 initialized in train them all will soon be depicted at the end of this section. (π‘Ž), apply R and P modules 𝐿 times, and obtain Structure: Figure 2 depicts the detailed architecture (𝐿) (𝐿) {([𝑀𝑖 ]π‘–βˆˆπ΄ , [𝑀𝑗 ]π‘—βˆˆπ΅ )}(𝐴,𝐡)∈𝐷 𝑏 . of module G. The input and output of module G are of train dimension 2𝑀 and a scalar respectively. As in modules R 𝑏 (𝑐) Prediction: For all samples (𝐴, 𝐡) ∈ 𝐷train , start with and P, we set the value of 𝑀 as per the given scenario of (𝐿) (𝐿) interest. As the dimensions are independent of the total {([𝑀𝑖 ]π‘–βˆˆπ΄ , [𝑀𝑗 ]π‘—βˆˆπ΅ )}(𝐴,𝐡)∈𝐷 𝑏 obtained in (𝑏), ap- train number of items (𝑛), the module does not suffer from ply G module, and obtain {𝑦𝐴𝐡 Μ‚ }(𝐴,𝐡)∈𝐷 𝑏 . train scalability issues. All layers are fully connected. The activations between layers are rectified linear units [12]. (𝑑) Model parameter update: Update the parameters of The final activation is the sigmoid function [13]. R, P, and G via the Adam optimizer [15] to minimize Operation: Given a group comparison (𝐴, 𝐡), the mod- the loss. To compute the loss, replace 𝐷obs in (1) 𝑏 by 𝐷train . Apply weight decay regularization with a ule takes as input the utility estimates of the items in groups 𝐴 and 𝐡, and produces as output the winning factor of 0.01. Repeat (𝑏)–(𝑑) for all batches 𝑏 ∈ [𝐡]. BTL-sum model BTL-product model The parameters are updated 𝐡 times using 𝐷train once in 0.70 0.39 Proposed Proposed MM-sum MM-sum cross entropy loss cross entropy loss one epoch. We use 500 epochs. In each, we compute a MM-prod MM-prod 0.69 SGD-HOI 0.38 SGD-HOI validation loss using 𝐷val . We apply early stopping and TrueSkill TrueSkill Rank Centrality Rank Centrality choose the parameters with the lowest loss. Menke & Martinez 0.37 Menke & Martinez 0.68 1/6 2/6 3/6 4/6 5/6 6/6 1/6 2/6 3/6 4/6 5/6 6/6 | train| (in fractions of | obs|) | train| (in fractions of | obs|) 5. Experimental Results HOI model A generalized Thurstone model Proposed Proposed MM-sum 0.61 MM-sum cross entropy loss cross entropy loss To verify the broad applicability of our framework, 0.72 MM-prod MM-prod SGD-HOI SGD-HOI we conduct extensive experiments using synthetic 0.68 TrueSkill 0.59 TrueSkill Rank Centrality Rank Centrality and real-world datasets. We compare it with six other Menke & Martinez Menke & Martinez algorithms: MM-sum [1], MM-prod [2], SGD-HOI [3], 0.64 0.571/6 1/6 2/6 3/6 4/6 5/6 6/6 2/6 3/6 4/6 5/6 6/6 TrueSkill [4], Rank Centrality [5], and an algorithm | train| (in fractions of | obs|) | train| (in fractions of | obs|) based on a single-layer neural network [7]. SGD-HOI Figure 4: Synthetic data experiments. From left to right, top has been developed for the factorization HOI model in to bottom: BTL-sum model, BTL-product model, HOI model, [3] and TrueSkill for the Thurstone model in [4]. As and a generalized Thurstone model. Off-the-scale curves due Rank Centrality has been developed for the case of to low performance algorithms are not shown. Amount of comparing two individual items, we consider its natural data used for training is described in fractions of 𝐷obs , 90% of extensions depending on the dataset. For example, in the dataset for training (x-axis). Performance is obtained by the sum model, we replace the summation in (2) by using 𝐷unobs , 10% of the dataset for testing (y-axis). (β„“) (β„“) βˆ‘(𝐴,𝐡)∈𝐷obs βˆΆπ‘–βˆˆπ΄ (𝑦𝐴𝐡 βˆ‘π‘˜βˆˆπ΅ π‘€π‘˜ + (1 βˆ’ 𝑦𝐴𝐡 ) βˆ‘π‘˜βˆˆπ΄ π‘€π‘˜ ). The single-layer neural network algorithm developed in have sufficient data, our framework achieves the perfor- [7] has a scalability issue. It requires at least one input mance promised by MM-sum, which has been shown node per item, thus becomes bloated with large 𝑛. This in [1] to achieve local minima in terms of cross entropy drawback prevents us from measuring its performance loss. (2) BTL-product model: Our framework achieves the in our setup5 for some real-world datasets such as GIFGIF optimal performance promised by MM-prod, which has and IMDb 5000 (to be presented in Table 1 in Section 5.2).been shown in [2] to achieve global minima in terms of cross entropy loss. (3) HOI model: SGD-HOI performs 5.1. Synthetic Data Experiments best in most settings. Our framework is second-best with a slight gap to the best performance in those settings, We use four synthetic datasets: BTL-sum, BTL-product, but performs best when we have sufficient data. This is HOI and a generalized Thurstone. We set 𝑛 = 300 and because our framework using neural networks is affected 𝑀 = 5. In the HOI model, we generate the ground-truth by overfitting with insufficient data. (4) Thurstone model: utilities and dimension-7 features using Gaussian dis- MM-prod and our framework perform best. TrueSkill tributions. In the others, we generate the ground-truth comes next with a gap, but it clearly outperforms the utilities uniformly at random. We generate 5𝑛 log 𝑛 dis- others. Interestingly, TrueSkill, developed specifically for tinct paired groups and each pair is compared 10 times.6 the Thurstone model, does not lead to the best perfor- We split generated datasets randomly into 𝐷obs (90%) mance. But this result does not run counter to theory, as and 𝐷unobs (10%). All algorithms use 𝐷obs to predict its optimality has not been shown in the literature. unobserved group comparisons in 𝐷unobs . They use a Our framework performs consistently best (or near- fraction (1%–2%) of 𝐷obs for validation purposes. We use best with a marginal gap) across all datasets, while the 𝐿 = 20 and the learning rate of 10βˆ’2 . We use four hidden others perform inconsistently across them. Some per- layers for all modules. For R and P, each layer consists form well in one, but poorly in the others (for example, of 7𝑀 nodes, and for G, each layer consists of 9𝑀 nodes. MM-sum performs best only in the BTL-sum model but Figure 4 shows our result. The performance curves poorly in all others). It is important to note that our of the algorithms that underperform by large gaps are framework outperforms the single-layer neural network not presented. Let us discuss each sub-figure from left baseline developed in [7] across all datasets. to right. (1) BTL-sum model: In most settings where we 5.2. Real-World Data Experiments 5 Intel Core i7-6850K @ 3.6GHz (CPU) and GeForce GTX 1080 Ti (Single GPU). As in Section 5.1, we split real-world datasets randomly 6 The 5𝑛 log 𝑛 is chosen in order to guarantee connectedness be- into 𝐷obs (90%) and 𝐷unobs (10%), and use a fraction tween any pair of nodes within a comparison (hyper-)graph where (1%–2%) of 𝐷obs for validation purposes if necessary. We nodes represent items and an hyper-edge represents a group com- parison [16]. Without connectedness guarantees, some algorithms may fail to produce meaningful estimates. Table 1 Match prediction results in terms of cross entropy loss and prediction accuracy in real-world dataset experiments (N/A: not available due to scalability issues). The best performances are boldfaced and the second-best are underlined. The numbers in parentheses indicate the ranks in each dataset and metric. GIFGIF HOTS DOTA 2 LoL IMDb 5000 CE-Loss Acc CE-Loss Acc CE-Loss Acc CE-Loss Acc CE-Loss Acc Proposed .3053 (1) .8729 (3) .6790 (1) .5673 (1) .6599 (1) .6090 (2) .6936 (1) .5411 (1) .8107 (1) .6016 (1) Menke & Martinez N/A N/A .6830 (6) .5458 (7) .6601 (2) .6127 (1) .6937 (2) .5236 (5) N/A N/A MM-sum .3114 (3) .8758 (1) .6796 (4) .5645 (4) .6610 (4) .6049 (5) .7036 (3) .5166 (7) 1.0719 (4) .5812 (2) MM-prod .3126 (4) .8758 (1) .6793 (3) .5660 (2) .6603 (3) .6083 (3) .7070 (5) .5249 (3) 3.7455 (5) .5428 (5) SGD-HOI .4056 (6) .8614 (5) .6798 (5) .5562 (5) .6673 (6) .5955 (6) .7125 (6) .5171 (6) .9471 (3) .5780 (3) TrueSkill .3063 (2) .8728 (4) .6857 (7) .5499 (6) .6683 (7) .5863 (7) .7057 (4) .5245 (4) .8409 (2) .5736 (4) Rank Centrality .3761 (5) .8555 (6) .6792 (2) .5660 (2) .6667 (5) .6051 (4) .7230 (7) .5276 (2) 3.7455 (5) .5096 (6) Table 2 Ablation study results that evaluate the effectiveness of modules R & P in real-world dataset experiments. β€œNot Trained” indicates R & P made defunct; to produce coarse item utility estimates, they are active once (𝐿 = 1) without refinements, and data augmentation and model updates are not applied. β€œTrained” indicates our complete framework. GIFGIF HOTS DOTA 2 LoL IMDb 5000 CE-Loss Acc CE-Loss Acc CE-Loss Acc CE-Loss Acc CE-Loss Acc R & P Not Trained 0.338 0.864 0.693 0.515 0.689 0.535 0.696 0.520 0.743 0.538 R & P Trained 0.296 0.875 0.682 0.560 0.660 0.614 0.690 0.539 0.716 0.576 Improvements (CE-Loss decreases, 12.291% 1.338% 1.514% 8.736% 4.255% 14.764% 0.880% 3.704% 3.616% 7.058% Accuracy increases) use five real-world datasets7 : GIFGIF, HOTS, DOTA 2, LoL, records.7 Two groups with five players each compete. IMDb 5000. We use 𝐿 = 30, 15, 15, 20, 20 and the learning The players choose heroes out of a pool of 140. There are rates of 10βˆ’3 , 10βˆ’3 , 10βˆ’2 , 10βˆ’2 , 10βˆ’2 respectively. We use 7,610 records. (5) IMDb 5000: A collection of meta-data four hidden layers for all modules. Each layer consists of for 5,000 movies.7 Each movie has a score and is asso- 7𝑀 nodes for R and P, and 9𝑀 nodes for G. ciated with keywords. To fit our purpose, we generate Let us discuss each dataset. (1) GIFGIF: A crowd- match records for movie pairs. We consider each movie sourcing project.7 We use the dataset pre-processed in as a group and its five keywords as its items. Given a [17]. A participant is presented with two images and pair, we declare a win for the one with a higher score. asked to choose one which better describes a given emo- We have 8,021 keywords and 123,420 samples. tion.8 This dataset belongs to a special case of our in- In addition to the cross entropy loss, we also consider terest as individual comparisons are concerned. We con- another metric highly-relevant in practice: prediction sider the emotion of happiness. We have 6,120 images accuracy. We declare it a win for a group if the estimate and 106,886 samples. (2) HOTS: A collection of HOTS of its winning probability is above a certain threshold, match records from 10/26/17 to 11/26/17 collected by which we set as 0.5 [18]. Thus, the prediction accuracy HOTS logs.7 Each match consists of two groups with five is expressed as follows9 : players each. The players choose heroes for each match out of a pool of 84. We choose high-quality matches only 1 βˆ‘ 𝑦𝐴𝐡 𝕀β‰₯ 1 (𝑦𝐴𝐡 Μ‚ ) + (1 βˆ’ 𝑦𝐴𝐡 )𝕀< 1 (𝑦𝐴𝐡 Μ‚ ). where all players are highly-skilled according to some |𝐷unobs | (𝐴,𝐡)∈𝐷 2 2 unobs available statistics. There are 26,486 records. (3) DOTA (8) 2: A collection of DOTA 2 match records.7 Each match consists of two groups with five players each, and they Table 1 shows our result. We boldface the best per- choose heroes out of a pool of 113. There are 50,000 formances and underline the second-best. The numbers records. (4) LoL: A collection of LoL professional match in parentheses indicate the ranks among the algorithms being compared in a given setup of dataset and metric. 7 gifgif.media.mit.edu (GIFGIF); hotslogs.com/Info/API Our framework consistently yields the top performances (HOTS); kaggle.com/devinanzelmo/dota-2-matches (DOTA 2); kaggle.com/chuckephron/leagueoflegends (LoL); kaggle.com/car- olzhangdc/imdb-5000-movie-dataset (IMDb 5000). 9 We define 𝕀β‰₯ 1 (π‘₯) as 1 if π‘₯ β‰₯ 12 and as 0 otherwise. Similarly, 2 8 β€œneither” is also allowed, but we exclude such data. 𝕀< 1 (π‘₯) equals to 1 if π‘₯ < 12 and to 0 otherwise. 2 in all cases, while the others suffer from significantly in- [4] R. Herbrich, T. Minka, T. Graepel, Trueskillβ„’: A consistent performances across datasets and/or metrics. bayesian skill rating system, Advances in neural To verify that our design of modules R and P plays a information processing systems (2007) 569–576. critical role in achieving consistently high performances [5] S. Negahban, S. Oh, D. Shah, Rank centrality: Rank- across various scenarios, we conduct ablation studies. ing from pair-wise comparisons, Operations Re- As our baseline, we consider the case where we refine search 65 (2016) 266–287. initialized estimates 𝑀(0) using R and P once (𝐿 = 1), data [6] D. R. Hunter, MM algorithms for generalized augmentation is not applied, and the model parameters Bradley-Terry models, Annals of Statistics (2004) of R and P are not updated. Thus, module G takes as 384–406. input coarse utility estimates and produces as output [7] J. E. Menke, T. R. Martinez, A Bradley–Terry artifi- a group comparison outcome prediction. This baseline cial neural network model for individual ratings in can be viewed as a naive neural network based approach group competitions, Neural Computing and Appli- without the key reward-penalty mechanisms. cations 17 (2008) 175–186. Table 2 shows our result. The best performance im- [8] R. A. Bradley, M. E. Terry, Rank analysis of in- provements (boldfaced) reach up to 12.291% in cross en- complete block designs: I. the method of paired tropy loss and 14.764% in prediction accuracy. The av- comparisons, Biometrika 39 (1952) 324–345. erage performance improvements are 4.511% in cross [9] C. DeLong, N. Pathak, K. Erickson, E. Perrino, entropy loss and 7.12% in prediction accuracy. Note that K. Shim, J. Srivastava, Teamskill: Modeling team across all datasets and metrics, it is empirically demon- chemistry in online multi-player games, Pacific- strated that the reward-penalty structure by modules R Asia Conference on Knowledge Discovery and Data and P brings consistent improvements. Mining (2011) 519–531. [10] M. Jang, S. Kim, C. Suh, S. Oh, Optimal sample com- plexity of 𝑀-wise data for top-𝐾 ranking, Advances 6. Conclusion in Neural Information Processing Systems (2017) 1686–1696. We explore the group match prediction problem where [11] Y. Chen, J. Fan, C. Ma, K. Wang, Spectral method the goal is to predict the probability of one group pre- and regularized MLE are both optimal for top-K ferred over the other given an unseen pair of groups, ranking, Annals of statistics 47 (2019) 2204–2235. based on partially observed group comparison data. We [12] X. Glorot, A. Bordes, Y. Bengio, Deep sparse rectifier develop a unified algorithmic framework that employs neural networks, Proceedings of the fourteenth neural networks and show that it can yield consistently international conference on artificial intelligence best performances compared to other state-of-the-art and statistics (2011) 315–323. algorithms on multiple datasets across various domains. [13] K. Hornik, M. Stinchcombe, H. White, Multilayer feedforward networks are universal approximators, Acknowledgments Neural networks 2 (1989) 359–366. [14] X. Glorot, Y. Bengio, Understanding the difficulty This work was supported by the ICT R&D program of of training deep feedforward neural networks, In- Institute of Information & Communications Technology ternational Conference on Machine Learning (2010) Planning & Evaluation (IITP) grant funded by the Korea 249–256. government (MSIT) (No. 2020-0-00626). [15] D. P. Kingma, J. Ba, Adam: A method for stochas- tic optimization, arXiv preprint arXiv:1412.6980 (2014). References [16] O. Cooley, M. Kang, C. Koch, Threshold and hit- ting time for high-order connectedness in random [1] T. K. Huang, C. J. Lin, R. C. Weng, Generalized hypergraphs, Electronic Journal of Combinatorics Bradley-Terry models and multi-class probability (2016) 2–48. estimates, Journal of Machine Learning Research 7 [17] L. Maystre, M. Grossglauser, Just sort it! A simple (2006) 85–115. and effective approach to active preference learn- [2] T. K. Huang, C. J. Lin, R. C. Weng, Ranking indi- ing, International Conference on Machine Learning viduals by group comparisons, Journal of Machine (2017) 2344–2353. Learning Research 9 (2008) 2187–2216. [18] O. Delalleau, E. Contal, E. Thibodeau-Laufer, R. C. [3] Y. Li, M. Cheng, K. Fujii, F. Hsieh, C.-J. Hsieh, Learn- Ferrari, Y. Bengio, F. Zhang, Beyond skill rating: ing from group comparisons: Exploiting higher or- Advanced matchmaking in ghost recon online, IEEE der interactions, Advances in Neural Information Transactions on Computational Intelligence and AI Processing Systems (2018) 4981–4990. in Games 4 (2012) 167–177.