Group Match Prediction via Neural Networks
Sunghyun Kim1 , Minje Jang2 and Changho Suh3
1
  Massachusetts Institute of Technology, 77 Massachusetts Avenue, Cambridge, MA, 02139, USA
2
  Krafton, 231 Teheran-ro, Gangnam-gu, Seoul, 06142, South Korea
3
  Korea Advanced Institute of Science and Technology, 291 Daehak-ro, Yuseong-gu, Daejeon, 34141, South Korea


                                             Abstract
                                             We consider the group match prediction problem where the goal is to estimate the probability of one group of items preferred
                                             over another, based on partially observed group comparison data. Most prior algorithms, each tailored to a specific statistical
                                             model, suffer from inconsistent performances across different scenarios. Motivated by a key common structure of the
                                             state-of-the-art, which we call the reward-penalty structure, we develop a unified framework that achieves consistently high
                                             performances across a wide range of scenarios. Our distinction lies in introducing neural networks embedded with the
                                             reward-penalty structure. Extensive experiments on synthetic and real-world datasets show that our framework consistently
                                             leads to the best performance, while the state-of-the-art perform inconsistently.

                                             Keywords
                                             group match prediction, group recommendation, neural network


1. Introduction                                                                                                          Second, no prior algorithm can offer a satisfactory per-
                                                                                                                      formance when a given scenario cannot be represented
We often compare a pair of items (or groups of items) and                                                             by well-studied models. For instance, underlying models
judge which one is of higher utility. We also become inter-                                                           for some scenarios can be inherently too complex to be
ested in predicting comparison outcomes for unobserved                                                                approximated by existing models. In such cases, all prior
pairs, based on partial comparison data for previously                                                                algorithms are ill-suited.
observed pairs, and in recommending select options in                                                                    We address all these challenges by developing a unified
preference to the others. We investigate the group match                                                              algorithmic framework that can (𝑖) infer and adapt to any
prediction (group recommendation) problem where we                                                                    underlying models for given scenarios; and (𝑖𝑖) achieve
aim to estimate the probability of one group of 𝑀 items                                                               consistently high performances.
preferred over another, based on partially observed group                                                                Main Contributions. First, we identify a key com-
comparison data.                                                                                                      mon structure (which we call the reward-penalty struc-
   Most prior algorithms postulate ground-truth utilities                                                             ture, to be detailed in Section 4.1) shared among state-of-
for items, and assume underlying models to exist. These                                                               the-art algorithms, and incorporate the structure into our
models are considered to describe the utility of a group                                                              framework. This emphasis on structural aspects enables
as a function of the utilities of its items, and govern sta-                                                          our framework to attain high performances (Section 5).
tistical patterns of comparison data based on the group                                                               Second, we introduce neural networks so as to enable our
utilities [1, 2, 3, 4]. Some algorithms have been devel-                                                              framework to infer and adapt to any latent models. We
oped for prominent yet specific statistical models, and                                                               design the neural networks to well-respect the reward-
shown to attain optimal performances (see Section 4.1).                                                               penalty structure in order to retain high performances
Yet, major challenges arise when we attempt to employ                                                                 across vairous scenarios. (Sections 4.2 and 4.3) Third, we
them in practice.                                                                                                     construct the neural networks in such a way that our
   First, prior algorithms lead to inconsistent perfor-                                                               framework can be robust against prohibitive scalability
mances. Tailored to specific models, they achieve decent                                                              issues. This robustness makes our framework applica-
performances in some scenarios that are well-represented                                                              ble to a broader range of scenarios which involve large
by their models, but perform poorly in others. This in-                                                               numbers of items.
consistency limits the application of each algorithm to a                                                                Insight from the state of the arts: Looking into state-
narrow range of scenarios.                                                                                            of-the-art algorithms in the tasks of match prediction
                                                                                                                      and rank aggregation (a related and long-studied task)
3rd Edition of Knowledge-aware and Conversational Recommender                                                         [5, 6, 1, 2], we find that they share a key element: so-
Systems (KaRS) & 5th Edition of Recommendation in Complex
                                                                                                                      called the reward-penalty mechanism in estimating the
Environments (ComplexRec) Joint Workshop @ RecSys 2021,
September 27–1 October 2021, Amsterdam, Netherlands                                                                   utilities of items. The mechanism rewards an item greatly
Envelope-Open sunghyun@mit.edu (S. Kim); jangminje427@krafton.com                                                     for winning (or being more preferred) in a disadvanta-
(M. Jang); chsuh@kaist.ac.kr (C. Suh)                                                                                 geous comparison where its group is weaker than the
                                       © 2021 Copyright for this paper by its authors. Use permitted under Creative
                                       Commons License Attribution 4.0 International (CC BY 4.0).                     counterpart. Likewise, it penalizes an item greatly for
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)
losing (or being less preferred) in an advantageous one.      performance improvements, we conduct ablation studies
Also, the magnitudes of such rewards and penalties are        (Table 2 in Section 5.2). As our baseline, we consider our
proportional to its contribution to its group. It turns       framework where the reward and penalty modules are
out that across different state-of-the-art algorithms, only   arbitrarily made defunct (Section 5.2 for details). Evaluat-
the terms corresponding to rewards and penalties vary         ing the performances of our framework and the baseline
(Section 4.1 for details), but the common mechanism re-       on all five real-world datasets in terms of both cross en-
mains unchanged. We take it as the main structure in          tropy loss and prediction accuracy, we demonstrate that
our framework ((5) in Section 4.2 for details).               the reward and penalty modules consistently improve
   Neural networks: Motivated by the above observations,      performances across all datasets and metrics, confirming
we incorporate three separate neural network modules          their effectiveness.
into our framework (Sections 4.2 and 4.3 for architec-
tural details): two modules represent reward and penalty
mechanisms, and the other combines them to produce            2. Related Work
final comparison outcome predictions. These modules
                                                              The most relevant prior works are [1, 2, 3, 4]. These works
are trained according to the given dataset so that they
                                                              assume prominent models (some long-established and
can adapt to any hidden models that underlie the dataset,
                                                              some widely used in practice) for in-group interactions
making our framework attain universal applicability. The
                                                              and group comparisons, and carry out statistical analysis
overall architecture that puts all three modules together
                                                              to a great extent. We compare our framework with the
is specifically designed to maintain the reward-penalty
                                                              algorithms developed therein.
structure, in order to retain high performances across
                                                                 The problem of estimating individual item utilities
various scenarios.
                                                              from group comparison data has been investigated in
   Scalability: A prior work employed a single-layer neu-
                                                              [1, 2]. They considered extensions of the BTL model
ral network as an initial effort to predict winning prob-
                                                              where the group utility is either the sum or the product
abilities of group matches [7]. This attempt showed a
                                                              of its items’ utilities, and group comparisons follow the
promising result, demonstrating improved prediction ac-
                                                              BTL model. We call the two models the BTL-sum and
curacy on a real-world online game dataset. However, it
                                                              BTL-product models respectively.
exhibits scalability issues. It requires one input node per
                                                                 A more advanced in-group interaction model has been
item, thus is too prohibitive to be extended for scenarios
                                                              developed in [3]. They considered a scenario where a
with large numbers of items (scaling with 𝑛). In contrast,
                                                              pair of individual items in a group leads to a synergy. The
we design our neural network modules carefully to avoid
                                                              group utility is represented as the sum of two quantities,
such scalability issues (Figures 1 and 2: the input dimen-
                                                              which we call the HOI model: (1) the sum of individual
sions are fixed to 2𝑀 regardless of 𝑛). It also provides
                                                              item utilities and (2) the sum of the products of all pairs of
higher performances compared to the prior work.
                                                              individual item utilities. The work of [9] has considered
   Extensive experiments: We compare our framework
                                                              a scenario where any 𝑘-tuple of items leads to a synergy.
with the single-layer baseline and other algorithms by ex-
                                                                 In [4], they assumed individual item utilities to be cen-
tensive experiments on synthetic and real-world datasets.
                                                              tered around a mean following a Gaussian distribution
Using synthetic datasets (Section 5.1), we show that our
                                                              and viewed the group utility as their sum, which we call
framework achieves the best performance across four
                                                              the Thurstone model. Their algorithm is widely used
prominent models in the literature, while the other al-
                                                              in skill rating systems of online games where groups of
gorithms suffer from inconsistent performances across
                                                              users compete.
different models. Three models consider extensions of
                                                                 The work of [7] has made an initial effort in employ-
the Bradley-Terry-Luce model [8] to the group compari-
                                                              ing a neural network to predict winning probabilities of
son scenario. The other considers a generalized version
                                                              group matches. It has been shown that a single-layer neu-
of the Thurstone model [4] widely used in skill rating
                                                              ral network can fit some variants of the BTL model [2]
systems of online games. Using real-world datasets (Sec-
                                                              and improve prediction accuracy through experiments
tion 5.2), we show that our framework performs con-
                                                              on a real-world online game dataset. As noted, one clear
sistently well across diverse real-world scenarios. We
                                                              distinction compared to our work lies in scalability. The
consider five real-world datasets (sources in Footnote 7).
                                                              neural network developed in the prior work requires one
One is a crowd-sourced image classification dataset, an-
                                                              input node per item, thus is prohibitive for scenarios with
other is a collection of movie ratings, and the other three
                                                              large numbers of items. In contrast, we develop a neural
are online game match records. In our performance eval-
                                                              network that requires only a fixed number of input nodes
uations, we use the cross entropy loss (see (1)) and the
                                                              regardless of the number of items (Section 4.2 for details).
prediction accuracy (see (8)).
   Ablation studies: To verify that the reward-penalty
structure in our framework plays a key role in bringing
3. Problem Setup                                                      (𝑑) Their magnitudes depend on the portion of an item’s
                                                                          contribution (or blame) within its group. Suppose
The goal is to predict comparison outcome likelihoods                     an item’s group wins (or loses) in a comparison. The
for unobserved pairs of groups, given partial group com-                  item is given a greater reward (or penalty) when its
parison data. Each data point consists of (𝐴, 𝐵, 𝑦𝐴𝐵 ),                   share of contribution (or blame) within the group is
(𝐴, 𝐵, 𝑦𝐴𝐵 ). 𝐴 and 𝐵 are groups of 𝑀 individual items.                   relatively greater.
𝑦𝐴𝐵 indicates which group is preferred: 𝑦𝐴𝐵 = 𝕀(𝐴 ≻ 𝐵)
                                                             Reward-Penalty Structure. Let us examine the algo-
where 𝐴 ≻ 𝐵 indicates 𝐴 preferred over 𝐵. We denote
                                                             rithms developed in extensions of the BTL model.
the set of observed comparisons by 𝐷obs and that of un-
observed comparisons by 𝐷unobs . Our training dataset • Rank Centrality [5] has been developed under the stan-
available is {(𝐴, 𝐵, 𝑦𝐴𝐵 )}(𝐴,𝐵)∈𝐷obs . We use the cross en-    dard BTL model where individual items are compared
tropy loss as our metric. Let us define 𝑦𝐴𝐵
                                          ̂ as the estimate     in pairs. It achieves the minimal sample complexity
of Pr [𝑦𝐴𝐵 = 1] produced by our algorithmic framework.          for top-𝐾 rank aggregation, whose task is to estimate
Then formally, our goal is to minimize:                         the set of top-𝐾 items, in certain regimes [10, 11]. As
                                                                in Section 3, we define 𝑦𝑖𝑗 = 𝕀(𝑖 ≻ 𝑗), given a pair of
    −1                                                          items 𝑖 and 𝑗. By some manipulation, the update rule
             ∑ 𝑦𝐴𝐵 log 𝑦𝐴𝐵   ̂ + (1 − 𝑦𝐴𝐵 ) log(1 − 𝑦𝐴𝐵
                                                      ̂ ).
  |𝐷obs | (𝐴,𝐵)∈𝐷                                                                                       (ℓ)
                  obs                                           at step ℓ for individual item utility 𝑤𝑖 of item 𝑖 can
                                                         (1)    be shown as : 1
                                                                               (ℓ+1)               (ℓ)                                         (ℓ)                       (ℓ)
Notation. [𝑛] = {1, 2, … , 𝑛} is the set of all items. We                 𝑤𝑖           ← 𝑤𝑖              +𝛼               ∑         (𝑦𝑖𝑗 𝑤𝑗 − (1 − 𝑦𝑖𝑗 )𝑤𝑖 ) .
use lowercase letters (e.g., 𝑖) for individual items, and up-                                                   𝑗∶(𝑖,𝑗)∈𝐷obs
percase letters (e.g., 𝐴) for sets of items. {𝑤𝑖 }𝑖∈𝐴 is the set                                                                                                               (2)
of 𝑤𝑖 ’s for all 𝑖 ∈ 𝐴. [𝑤𝑖 ]𝑖∈𝐴 is a vector of 𝑤𝑖 ’s for all 𝑖 ∈ 𝐴                                        (ℓ)
                                                                         For item 𝑖, one can interpret 𝑤𝑗 as the reward since
and its ordering is provided in the context. 𝑦̂ is an esti-                              (ℓ+1)                             (ℓ)
mate of 𝑦. We use subscripts as in 𝑦𝐴𝐵 when 𝑦 concerns a                 it increases 𝑤𝑖       when 𝑖 ≻ 𝑗 (𝑦𝑖𝑗 = 1), and 𝑤𝑖 (next
                                                                                                                               (ℓ+1)
comparison between 𝐴 and 𝐵. We use superscripts as in                    to (1 − 𝑦𝑖𝑗 )) as the penalty since it decreases 𝑤𝑖
𝑤 (ℓ) when 𝑤 is updated iteratively. We use bold symbols                 when 𝑖 ≺ 𝑗 (𝑦𝑖𝑗 = 0). 𝛼 is a step size in the update.
as in 𝑤 to indicate the vector of 𝑤𝑖 ’s for all 𝑖 ∈ [𝑛].                 Note that the reward is large when the opponent’s
                                                                         utility estimate is large; we give a large reward for a
                                                                         win against a strong opponent (observation (𝑎) above).
4. Proposed Algorithm                                                    Likewise, the penalty is large when its own utility
                                                                         estimate is large; we give a large penalty for a loss
4.1. Motivation                                                          against a weak opponent (observation (𝑐) above).
Our framework design is inspired by optimal state-of-the-             • Majorization-Minimization (MM) for the BTL-sum
art algorithms (achieving minimal sample complexity or                                                                 (ℓ)
                                                                        model has been developed in [6, 1]. We define 𝑤𝐴 ∶=
global minima of cross entropy loss) in extensions of the                              (ℓ)
Bradley-Terry-Luce model (BTL) model [8]. We make                        ∑𝑖∈𝐴 𝑤𝑖 . By some manipulation, the update rule at
                                                                                                                                                (ℓ)
the following key observations regarding the structural                  step ℓ for individual item utility 𝑤𝑖                                        of item 𝑖 can be
similarities shared by the state-of-the-art:                             shown as2 :
                                                                              (ℓ+1)
(𝑎) They exhibit “reward” and “penalty” terms in the es-                  𝑤𝑖           ←                                                                                       (3)
    timation process (details in (2) and (3)). In updating                       (ℓ)                                          (ℓ)                  (ℓ)
                                                                               𝑤𝑖 + 𝛼 𝑖                  ∑            (𝑦𝐴𝐵 ⋅ R𝐴𝐵,𝑖 − (1 − 𝑦𝐴𝐵 ) ⋅ P𝐴𝐵,𝑖 )
    an item’s utility estimate, they reward the item for                                       (𝐴,𝐵)∈𝐷obs
    contributing to its group’s winning, and penalize it                                          𝑖∈𝐴
    for contributing to its group’s losing.                              where
                                                                                                    (ℓ)                   (ℓ)                                (ℓ)               (ℓ)
(𝑏) The expressions of these reward/penalty terms vary                     (ℓ)                 𝑤𝐵                    𝑤𝑖                  (ℓ)             𝑤𝐴                  𝑤𝑖
    as the underlying models change.                                     R𝐴𝐵,𝑖 =                                ⋅               ,   P𝐴𝐵,𝑖 =                              ⋅           .
                                                                                             (ℓ)          (ℓ)             (ℓ)                          (ℓ)         (ℓ)         (ℓ)
                                                                                        𝑤𝐴 + 𝑤 𝐵                     𝑤𝐴                               𝑤𝐴 + 𝑤 𝐵               𝑤𝐴
(𝑐) The magnitudes of rewards and penalties depend                                                                                                                            (4)
    on the power dynamics between groups. A greater                       1
                                                                            As in [5], 𝛼 = max1 𝑑 where 𝑑𝑖 is the number of distinct items to
    reward is given to an item when its group is rela-                                         𝑖 𝑖
                                                                      which item 𝑖 is compared. Also, we describe Rank Centrality as an
    tively weaker compared to the opponent group, and                 iterative algorithm (one way to obtain the stationary distribution of
    likewise a greater penalty is given when its group                the empirical pairwise preference matrix) to highlight its inherent
    is relatively stronger.                                           reward-and-penalty structure.
                                                                                                                                         −1
                                                                          2                                         (ℓ)         (ℓ) −1
                                                                              𝛼𝑖 = (∑(𝐴,𝐵)∈𝐷obs ∶𝑖∈𝐴 (𝑤𝐴 + 𝑤𝐵 ) ) .
                                                                                                                 Operation: Recall that our comparison data samples
                                                                                                              do not include utility estimate information. Each sample
    Note that the update rule (3) is similar to (2) of Rank                                                   only specifies the pair of groups compared (𝐴, 𝐵) and
    Centrality, but the reward and penalty terms in (4) are                                                   the comparison outcome 𝑦𝐴𝐵    ̂ . Thus, we begin with a
    different (observation (𝑏) above). The interpretation                                                     randomly initialized utility estimate vector 𝑤(0) ∈ ℝ𝑛 (ini-
    is similar. The reward for item 𝑖 is large when the                                                       tialization details in Section 4.4). Starting from 𝑤(0) , we
                                                         (ℓ)
    opponent group’s utility estimate is large (see 𝑤𝐵                                                        obtain 𝑤(𝐿) ∈ ℝ𝑛 by applying modules R and P repeatedly
                                                 (ℓ)
   in the numerator of R𝐴𝐵,𝑖 ). Note also that the larger                                                     𝐿 times to update 𝑤(ℓ) as follows:
   the contribution of item 𝑖 within its own group (see
                                                                                                                  (ℓ+1)
     (ℓ)  (ℓ)   (ℓ)
   𝑤𝑖 /𝑤𝐴 of R𝐴𝐵,𝑖 ), the greater the reward (observation                                                       𝑤𝑖        ←                                                   (5)
   (𝑑) above). The same holds for the penalty.                                                                         (ℓ)                       (ℓ)                  (ℓ)
                                                                                                                     𝑤𝑖 + 𝛼        ∑     (𝑦𝐴𝐵 ⋅ R𝐴𝐵,𝑖 − (1 − 𝑦𝐴𝐵 ) ⋅ P𝐴𝐵,𝑖 ) ,
                                                                        (𝐴,𝐵)∈𝐷obs
• MM for the BTL-product model has been developed                           𝑖∈𝐴
  in [2] and shown to achieve global minima in terms of
  cross entropy loss. Its individual item utility update where4 𝛼 = max𝑐 𝑑 and 𝑑𝑖 = |{(𝐴, 𝐵) ∶ 𝑖 ∈ 𝐴 ∪ 𝐵}|.
                                                                            𝑖 𝑖
  rule is described as in (3) but the reward and penalty
                      3                                     At step ℓ, given 𝑤(ℓ) , R and P produce positive real
  terms are different. A similar interpretation applies,
                                                          values in [0, 1]. They are reward and penalty values re-
  and all observations (𝑎)–(𝑑) can be found.
                                                          spectively, and are used to obtain 𝑤(ℓ+1) as per (5). They
  We introduce two separate modules to represent re- are given as follows:
wards and penalties respectively. We employ neural net-
works for the modules so they can adapt to the underlying               (ℓ)            (ℓ)        (ℓ)
models for any given scenario and dataset.                            R𝐴𝐵,𝑖 = R (𝑖, [𝑤𝑗 ]𝑗∈𝐴 , [𝑤𝑘 ]𝑘∈𝐵 ) ,
                                                                                                                  (6)
                                                                        (ℓ)            (ℓ)        (ℓ)
                                                                       P𝐴𝐵,𝑖 = P (𝑖, [𝑤𝑗 ]𝑗∈𝐴 , [𝑤𝑘 ]𝑘∈𝐵 ) .
          R R and P
4.2. Modules
                                                                                                              At the end of step ℓ, we normalize 𝑤(ℓ+1) to be zero-
                                                                                                                                                         (ℓ+1)
   (ℓ)                                                                               (ℓ)
                                                                                     (t)
                                                                                   [RAB,i ]i∈A                            (ℓ+1)        (ℓ+1)      𝑛     𝑤𝑖
[wi ]i∈A                                                                                                      mean 𝑤𝑖             ← 𝑤𝑖        𝑛
                                                                                                                                               − ∑𝑖=1            , and unity-norm
                                                                                                                          (ℓ+1)
                                                                                                                (ℓ+1)   𝑤𝑖
                                                                                                              𝑤𝑖      ← (ℓ+1) . We repeat until we obtain 𝑤(𝐿) . Fig-
                         ReLU             ReLU           ReLU   ReLU           sigmoid
                                                                                                                       ‖𝑤     ‖    2
                                                                                                              ure 3 illustrates the process. See the first stage therein.
                                                                                                                 Scalability: The input and output dimensions (2𝑀) are
   (ℓ)                                                                                   (ℓ)
                                                                                         (t)
[wj ]j∈B input                            fully connected             output
              layer                        hidden layers               layer
                                                                                   [RAB,j ]j∈B                independent of the total number of items (𝑛). Thus, our
                                                                                                              framework does not suffer from scalability issues in con-
                                                                                                              trast to a prior neural network based approach [7] in
Figure 1: Architecture of module R (and P, which is omitted                                                   which the input dimension of the neural network therein
as it has the same structure with different weights). It takes                                                scales in proportion to 𝑛. This scalability issue for the
as input utility estimates for the 2𝑀 items in a pair of groups
                                                                                                              prior approach manifests in our real-world datasets with
(𝐴, 𝐵) and produces as output reward and penalty estimates
for the items.
                                                                                                              large numbers of items. See our real-world data experi-
                                                                                                              ment results for GIFGIF and IMDb 5000 datasets in Table 1
   Structure: Figure 1 depicts the detailed architecture of                                                   in Section 5.2.
the two modules which reflect rewards (R) and penal-                                                             Refinement: We apply R and P multiple times (𝐿 > 1)
ties (P) discussed in Section 4.1. The input and output                                                       to obtain a final utility estimate vector 𝑤(𝐿) . The best
of module R and P are of dimension 2𝑀. 𝑀 can be set                                                           value of 𝐿 is set via hyper-parameter tuning. This series
arbitrarily according to the scenario of interest. R (P re-                                                   of applications is to “refine” utility estimates.
spectively) takes as input the current utility estimates of                                                      Data augmentation: R and P take as input a con-
                                                                                                                                               (ℓ)            (ℓ)
the individual items in a given group comparison (𝐴, 𝐵),                                                      catenation of two vectors [𝑤𝑗 ]𝑗∈𝐴 and [𝑤𝑘 ]𝑘∈𝐵 . As
and produces as output the current reward (penalty re-                                                        they are vectors, not sets, ordering matters. We ap-
spectively) estimates for the items. All layers are fully                                                     ply data augmentation to make our framework ro-
connected. The activations between layers are rectified                                                       bust against arbitrary orderings of the items within
linear units [12]. The final activation is the sigmoid func-                                                  a group. Specifically, given a sample, we create ex-
tion [13].                                                                                                    tra samples which represent the same outcome but
                         (ℓ)         −1                                                  (ℓ)
                                                                                                              have different item orderings. For example, given
     3              𝑤𝐴                                 (ℓ)      (ℓ)    (ℓ)          𝑤𝐵                  (ℓ)
         𝛼𝑖 = (∑   (ℓ)         (ℓ)   )    where 𝑤𝐴 = ∏𝑖 𝑤𝑖 , R𝐴𝐵,𝑖 =               (ℓ)         (ℓ)   ⋅ 𝑤𝑖 ,
                 𝑤𝐴 +𝑤𝐵                                                          𝑤𝐴 +𝑤𝐵                          4
            (ℓ)
                                                                                                                   Prior work (see Footnote 1) has motivated the choice of 𝛼. The
 (ℓ)     𝑤            (ℓ)
P𝐴𝐵,𝑖 = (ℓ) 𝐴 (ℓ) ⋅ 𝑤𝑖 .                                                                                      numerator 𝑐 is determined by hyper-parameter tuning.
       𝑤𝐴 +𝑤𝐵
a sample (𝐴 = (1, 2), 𝐵 = (3, 4), 𝑦𝐴𝐵 = 1), we create ex-       probability estimate of 𝐴 preferred over 𝐵:
tra samples such as (𝐴′ = (2, 1), 𝐵′ = (4, 3), 𝑦𝐴′ 𝐵′ = 1).
                                                                                                           (ℓ)                (ℓ)
We also seek robustness against arbitrary orderings                                𝑦𝐴𝐵
                                                                                    ̂ = G ([𝑤𝑖 ]𝑖∈𝐴 , [𝑤𝑗 ]𝑗∈𝐵 ) .                                     (7)
of the groups. Specifically, we create extra sam-
ples by changing the order of two sets 𝐴 and 𝐵 as               As in (6), module G takes as input a concatenation of
in (𝐴 = (3, 4), 𝐵 = (1, 2), 𝑦𝐴𝐵 = 0) and 𝐴′ and 𝐵′ as in                        (ℓ)           (ℓ)
                                                                two vectors [𝑤𝑖 ]𝑖∈𝐴 and [𝑤𝑗 ]𝑗∈𝐵 . The item and group
(𝐴′ = (4, 3), 𝐵′ = (2, 1), 𝑦𝐴′ 𝐵′ = 0). These techniques        orderings for the input to G are the same as those for the
help train R and P so as to be robust against item or-          input to R and P. As mentioned, the item utilities used
derings within a group and group orderings.                     as input to module G are obtained from the operation of
   Ablation study: R and P are the most distinctive fea-        modules R and P.
tures of our framework. To evaluate their effectiveness,
                                                                                     1st stage                                         2nd stage
we conduct ablation studies. See Table 2 in Section 5.2.
                                                                          (ℓ)                                                         (L)
                                                                        [wi ]i∈A                                                    [wi ]i∈A

4.3. Module G                                                    w(ℓ)     (ℓ)
                                                                                     R             (ℓ)
                                                                                                 [RAB,i ]i∈A      w(ℓ+1) w(L)         (L)
                                                                                                                                                   G
                                                                        [wj ]j∈B                                 zero-mean          [wj ]j∈B
                                                                                                   (ℓ)
                                                                                     P           [PAB,j ]j∈B     unity-norm


  (L)
[wi ]i∈A
                                                                Figure 3: Overall architecture of our framework.

                   ReLU   ReLU       ReLU   ReLU    sigmoid        Figure 3 illustrates the overall architecture where mod-
                                                                ules R, P and G are put together. The first stage depicts
  (L)
  (t)                                                           the operation of R and P (initialization of 𝑤(0) omitted).
[w     j∈B input
 [wjj ]j∈B                fully connected      output
           layer           hidden layers        layer           The second stage depicts the operation of G. Note that
                                                                the reward-penalty structure is present. This structure
                                                                is unaffected during the training process where only the
Figure 2: Architecture of module G. It takes as input utility   parameters inside the modules are updated.
estimates for the 2𝑀 items in (𝐴, 𝐵) (obtained by modules R
and P) and produces as output a group comparison prediction.
                                                                4.4. Training Procedure
   To perform our main task, we need more than rewards          We split 𝐷obs randomly into 𝐷train and 𝐷val . We let 𝐷val
and penalties. As seen in Section 4.1, latent models can        be a fraction (1%–2%) of 𝐷obs and use it for validation
be assumed to exist. In-group interaction models (among         purposes. We divide 𝐷train into 𝐵 equal-size batches. We
items within a group) lead to group utilities. Group com-                   𝑏
                                                                denote by 𝐷train the data in batch 𝑏 ∈ [𝐵].
parison models (between a pair of groups) govern statisti-
cal patterns of comparison outcomes. The role of module         (𝑎) Initialization: Initialize 𝑤(0) using a Gaussian distri-
G is to fit these models. Modules R and P help quantify             bution whose mean is zero and variance is the nor-
and provide item utility estimates. Then module G takes             malized identity matrix, and the parameters of R, P
as input the item utility estimates for a pair of groups,           and G using the Xavier initialization [14].
and produces as output the probability of one group pre-
                                                                                                          𝑏
                                                                (𝑏) Refinement: For all samples (𝐴, 𝐵) ∈ 𝐷train , start
ferred over the other. The three modules interact closely
to perform the task. The overall architecture that ties                             (0)                  (0)
                                                                    with {([𝑤𝑖 ]𝑖∈𝐴 , [𝑤𝑗 ]𝑗∈𝐵 )}(𝐴,𝐵)∈𝐷 𝑏 initialized in
                                                                                                        train
them all will soon be depicted at the end of this section.          (𝑎), apply R and P modules 𝐿 times, and obtain
   Structure: Figure 2 depicts the detailed architecture                 (𝐿)       (𝐿)
                                                                    {([𝑤𝑖 ]𝑖∈𝐴 , [𝑤𝑗 ]𝑗∈𝐵 )}(𝐴,𝐵)∈𝐷 𝑏 .
of module G. The input and output of module G are of                                                                  train
dimension 2𝑀 and a scalar respectively. As in modules R                                                   𝑏
                                                                (𝑐) Prediction: For all samples (𝐴, 𝐵) ∈ 𝐷train , start with
and P, we set the value of 𝑀 as per the given scenario of                  (𝐿)            (𝐿)
interest. As the dimensions are independent of the total            {([𝑤𝑖 ]𝑖∈𝐴 , [𝑤𝑗 ]𝑗∈𝐵 )}(𝐴,𝐵)∈𝐷 𝑏 obtained in (𝑏), ap-
                                                                                                    train
number of items (𝑛), the module does not suffer from                ply G module, and obtain {𝑦𝐴𝐵  ̂ }(𝐴,𝐵)∈𝐷 𝑏 .
                                                                                                                                    train
scalability issues. All layers are fully connected. The
activations between layers are rectified linear units [12].     (𝑑) Model parameter update: Update the parameters of
The final activation is the sigmoid function [13].                  R, P, and G via the Adam optimizer [15] to minimize
   Operation: Given a group comparison (𝐴, 𝐵), the mod-             the loss. To compute the loss, replace 𝐷obs in (1)
                                                                          𝑏
                                                                    by 𝐷train . Apply weight decay regularization with a
ule takes as input the utility estimates of the items in
groups 𝐴 and 𝐵, and produces as output the winning                  factor of 0.01. Repeat (𝑏)–(𝑑) for all batches 𝑏 ∈ [𝐵].
                                                                                                                 BTL-sum model                                                       BTL-product model
The parameters are updated 𝐵 times using 𝐷train once in                                      0.70                                                                 0.39
                                                                                                                          Proposed                                                              Proposed
                                                                                                                          MM-sum                                                                MM-sum


                                                                          cross entropy loss


                                                                                                                                               cross entropy loss
one epoch. We use 500 epochs. In each, we compute a                                                                       MM-prod                                                               MM-prod
                                                                                             0.69                         SGD-HOI                                 0.38                          SGD-HOI
validation loss using 𝐷val . We apply early stopping and                                                                  TrueSkill                                                             TrueSkill
                                                                                                                          Rank Centrality                                                       Rank Centrality
choose the parameters with the lowest loss.                                                                               Menke & Martinez                        0.37                          Menke & Martinez
                                                                                             0.68
                                                                                                    1/6     2/6 3/6 4/6 5/6 6/6                                          1/6     2/6 3/6 4/6 5/6 6/6
                                                                                                          | train| (in fractions of | obs|)                                    | train| (in fractions of | obs|)
5. Experimental Results                                                                                            HOI model                                                   A generalized Thurstone model
                                                                                                                         Proposed                                                              Proposed
                                                                                                                         MM-sum                              0.61                              MM-sum


                                                                        cross entropy loss


                                                                                                                                             cross entropy loss
To verify the broad applicability of our framework,                                     0.72                             MM-prod                                                               MM-prod
                                                                                                                         SGD-HOI                                                               SGD-HOI
we conduct extensive experiments using synthetic                                        0.68
                                                                                                                         TrueSkill
                                                                                                                                                             0.59
                                                                                                                                                                                               TrueSkill
                                                                                                                         Rank Centrality                                                       Rank Centrality
and real-world datasets. We compare it with six other                                                                    Menke & Martinez                                                      Menke & Martinez
algorithms: MM-sum [1], MM-prod [2], SGD-HOI [3],                                       0.64
                                                                                                                                                             0.571/6
                                                                                                1/6         2/6 3/6 4/6 5/6 6/6                                                  2/6 3/6 4/6 5/6 6/6
TrueSkill [4], Rank Centrality [5], and an algorithm                                                      | train| (in fractions of | obs|)                                    | train| (in fractions of | obs|)
based on a single-layer neural network [7]. SGD-HOI                      Figure 4: Synthetic data experiments. From left to right, top
has been developed for the factorization HOI model in                    to bottom: BTL-sum model, BTL-product model, HOI model,
[3] and TrueSkill for the Thurstone model in [4]. As                     and a generalized Thurstone model. Off-the-scale curves due
Rank Centrality has been developed for the case of                       to low performance algorithms are not shown. Amount of
comparing two individual items, we consider its natural                  data used for training is described in fractions of 𝐷obs , 90% of
extensions depending on the dataset. For example, in                     the dataset for training (x-axis). Performance is obtained by
the sum model, we replace the summation in (2) by                        using 𝐷unobs , 10% of the dataset for testing (y-axis).
                              (ℓ)                     (ℓ)
∑(𝐴,𝐵)∈𝐷obs ∶𝑖∈𝐴 (𝑦𝐴𝐵 ∑𝑘∈𝐵 𝑤𝑘 + (1 − 𝑦𝐴𝐵 ) ∑𝑘∈𝐴 𝑤𝑘 ).
The single-layer neural network algorithm developed in    have sufficient data, our framework achieves the perfor-
[7] has a scalability issue. It requires at least one input
                                                          mance promised by MM-sum, which has been shown
node per item, thus becomes bloated with large 𝑛. This    in [1] to achieve local minima in terms of cross entropy
drawback prevents us from measuring its performance       loss. (2) BTL-product model: Our framework achieves the
in our setup5 for some real-world datasets such as GIFGIF optimal performance promised by MM-prod, which has
and IMDb 5000 (to be presented in Table 1 in Section 5.2).been shown in [2] to achieve global minima in terms of
                                                          cross entropy loss. (3) HOI model: SGD-HOI performs
5.1. Synthetic Data Experiments                           best in most settings. Our framework is second-best with
                                                          a slight gap to the best performance in those settings,
We use four synthetic datasets: BTL-sum, BTL-product, but performs best when we have sufficient data. This is
HOI and a generalized Thurstone. We set 𝑛 = 300 and because our framework using neural networks is affected
𝑀 = 5. In the HOI model, we generate the ground-truth by overfitting with insufficient data. (4) Thurstone model:
utilities and dimension-7 features using Gaussian dis- MM-prod and our framework perform best. TrueSkill
tributions. In the others, we generate the ground-truth comes next with a gap, but it clearly outperforms the
utilities uniformly at random. We generate 5𝑛 log 𝑛 dis- others. Interestingly, TrueSkill, developed specifically for
tinct paired groups and each pair is compared 10 times.6 the Thurstone model, does not lead to the best perfor-
   We split generated datasets randomly into 𝐷obs (90%) mance. But this result does not run counter to theory, as
and 𝐷unobs (10%). All algorithms use 𝐷obs to predict its optimality has not been shown in the literature.
unobserved group comparisons in 𝐷unobs . They use a          Our framework performs consistently best (or near-
fraction (1%–2%) of 𝐷obs for validation purposes. We use best with a marginal gap) across all datasets, while the
𝐿 = 20 and the learning rate of 10−2 . We use four hidden others perform inconsistently across them. Some per-
layers for all modules. For R and P, each layer consists form well in one, but poorly in the others (for example,
of 7𝑀 nodes, and for G, each layer consists of 9𝑀 nodes. MM-sum performs best only in the BTL-sum model but
   Figure 4 shows our result. The performance curves poorly in all others). It is important to note that our
of the algorithms that underperform by large gaps are framework outperforms the single-layer neural network
not presented. Let us discuss each sub-figure from left baseline developed in [7] across all datasets.
to right. (1) BTL-sum model: In most settings where we
                                                                         5.2. Real-World Data Experiments
    5
       Intel Core i7-6850K @ 3.6GHz (CPU) and GeForce GTX 1080
Ti (Single GPU).                                                        As in Section 5.1, we split real-world datasets randomly
     6
       The 5𝑛 log 𝑛 is chosen in order to guarantee connectedness be-   into 𝐷obs (90%) and 𝐷unobs (10%), and use a fraction
tween any pair of nodes within a comparison (hyper-)graph where         (1%–2%) of 𝐷obs for validation purposes if necessary. We
nodes represent items and an hyper-edge represents a group com-
parison [16]. Without connectedness guarantees, some algorithms
may fail to produce meaningful estimates.
Table 1
Match prediction results in terms of cross entropy loss and prediction accuracy in real-world dataset experiments (N/A: not
available due to scalability issues). The best performances are boldfaced and the second-best are underlined. The numbers in
parentheses indicate the ranks in each dataset and metric.

                               GIFGIF                    HOTS                  DOTA 2                           LoL                    IMDb 5000
                         CE-Loss       Acc        CE-Loss      Acc       CE-Loss      Acc              CE-Loss         Acc         CE-Loss      Acc
      Proposed          .3053 (1)   .8729 (3)    .6790 (1)  .5673 (1)   .6599 (1)  .6090 (2)          .6936 (1)    .5411 (1)      .8107 (1)  .6016 (1)
  Menke & Martinez        N/A         N/A        .6830 (6)  .5458 (7)   .6601 (2)  .6127 (1)          .6937 (2)     .5236 (5)        N/A       N/A
      MM-sum            .3114 (3)  .8758 (1)     .6796 (4)  .5645 (4)   .6610 (4)  .6049 (5)          .7036 (3)     .5166 (7)     1.0719 (4) .5812 (2)
     MM-prod            .3126 (4)  .8758 (1)     .6793 (3)  .5660 (2)   .6603 (3)  .6083 (3)          .7070 (5)     .5249 (3)     3.7455 (5) .5428 (5)
     SGD-HOI            .4056 (6)   .8614 (5)    .6798 (5)  .5562 (5)   .6673 (6)  .5955 (6)          .7125 (6)     .5171 (6)     .9471 (3)  .5780 (3)
      TrueSkill         .3063 (2)   .8728 (4)    .6857 (7)  .5499 (6)   .6683 (7)  .5863 (7)          .7057 (4)     .5245 (4)     .8409 (2)  .5736 (4)
   Rank Centrality      .3761 (5)   .8555 (6)    .6792 (2)  .5660 (2)   .6667 (5)  .6051 (4)          .7230 (7)     .5276 (2)     3.7455 (5) .5096 (6)


Table 2
Ablation study results that evaluate the effectiveness of modules R & P in real-world dataset experiments. “Not Trained”
indicates R & P made defunct; to produce coarse item utility estimates, they are active once (𝐿 = 1) without refinements, and
data augmentation and model updates are not applied. “Trained” indicates our complete framework.

                                GIFGIF                   HOTS                 DOTA 2                               LoL                 IMDb 5000
                           CE-Loss     Acc         CE-Loss     Acc       CE-Loss    Acc                     CE-Loss     Acc        CE-Loss     Acc
   R & P Not Trained        0.338     0.864         0.693     0.515       0.689    0.535                     0.696     0.520        0.743     0.538
     R & P Trained          0.296     0.875         0.682     0.560       0.660    0.614                     0.690     0.539        0.716     0.576
     Improvements
  (CE-Loss decreases,      12.291%      1.338%      1.514%     8.736%     4.255%          14.764%           0.880%       3.704%     3.616%        7.058%
  Accuracy increases)


use five real-world datasets7 : GIFGIF, HOTS, DOTA 2, LoL,              records.7 Two groups with five players each compete.
IMDb 5000. We use 𝐿 = 30, 15, 15, 20, 20 and the learning               The players choose heroes out of a pool of 140. There are
rates of 10−3 , 10−3 , 10−2 , 10−2 , 10−2 respectively. We use          7,610 records. (5) IMDb 5000: A collection of meta-data
four hidden layers for all modules. Each layer consists of              for 5,000 movies.7 Each movie has a score and is asso-
7𝑀 nodes for R and P, and 9𝑀 nodes for G.                               ciated with keywords. To fit our purpose, we generate
   Let us discuss each dataset. (1) GIFGIF: A crowd-                    match records for movie pairs. We consider each movie
sourcing project.7 We use the dataset pre-processed in                  as a group and its five keywords as its items. Given a
[17]. A participant is presented with two images and                    pair, we declare a win for the one with a higher score.
asked to choose one which better describes a given emo-                 We have 8,021 keywords and 123,420 samples.
tion.8 This dataset belongs to a special case of our in-                   In addition to the cross entropy loss, we also consider
terest as individual comparisons are concerned. We con-                 another metric highly-relevant in practice: prediction
sider the emotion of happiness. We have 6,120 images                    accuracy. We declare it a win for a group if the estimate
and 106,886 samples. (2) HOTS: A collection of HOTS                     of its winning probability is above a certain threshold,
match records from 10/26/17 to 11/26/17 collected by                    which we set as 0.5 [18]. Thus, the prediction accuracy
HOTS logs.7 Each match consists of two groups with five                 is expressed as follows9 :
players each. The players choose heroes for each match
out of a pool of 84. We choose high-quality matches only                     1
                                                                                        ∑                   𝑦𝐴𝐵 𝕀≥ 1 (𝑦𝐴𝐵
                                                                                                                       ̂ ) + (1 − 𝑦𝐴𝐵 )𝕀< 1 (𝑦𝐴𝐵
                                                                                                                                              ̂ ).
where all players are highly-skilled according to some                    |𝐷unobs | (𝐴,𝐵)∈𝐷                          2                        2
                                                                                                    unobs
available statistics. There are 26,486 records. (3) DOTA                                                                                              (8)
2: A collection of DOTA 2 match records.7 Each match
consists of two groups with five players each, and they                     Table 1 shows our result. We boldface the best per-
choose heroes out of a pool of 113. There are 50,000                     formances and underline the second-best. The numbers
records. (4) LoL: A collection of LoL professional match                 in parentheses indicate the ranks among the algorithms
                                                                         being compared in a given setup of dataset and metric.
    7
      gifgif.media.mit.edu      (GIFGIF);    hotslogs.com/Info/API       Our framework consistently yields the top performances
(HOTS); kaggle.com/devinanzelmo/dota-2-matches (DOTA 2);
kaggle.com/chuckephron/leagueoflegends (LoL); kaggle.com/car-
olzhangdc/imdb-5000-movie-dataset (IMDb 5000).
                                                                               9
                                                                                   We define 𝕀≥ 1 (𝑥) as 1 if 𝑥 ≥ 12 and as 0 otherwise. Similarly,
                                                                                                2
    8
      “neither” is also allowed, but we exclude such data.               𝕀< 1 (𝑥) equals to 1 if 𝑥 < 12 and to 0 otherwise.
                                                                           2
in all cases, while the others suffer from significantly in-     [4] R. Herbrich, T. Minka, T. Graepel, Trueskill™: A
consistent performances across datasets and/or metrics.              bayesian skill rating system, Advances in neural
   To verify that our design of modules R and P plays a              information processing systems (2007) 569–576.
critical role in achieving consistently high performances        [5] S. Negahban, S. Oh, D. Shah, Rank centrality: Rank-
across various scenarios, we conduct ablation studies.               ing from pair-wise comparisons, Operations Re-
As our baseline, we consider the case where we refine                search 65 (2016) 266–287.
initialized estimates 𝑤(0) using R and P once (𝐿 = 1), data      [6] D. R. Hunter, MM algorithms for generalized
augmentation is not applied, and the model parameters                Bradley-Terry models, Annals of Statistics (2004)
of R and P are not updated. Thus, module G takes as                  384–406.
input coarse utility estimates and produces as output            [7] J. E. Menke, T. R. Martinez, A Bradley–Terry artifi-
a group comparison outcome prediction. This baseline                 cial neural network model for individual ratings in
can be viewed as a naive neural network based approach               group competitions, Neural Computing and Appli-
without the key reward-penalty mechanisms.                           cations 17 (2008) 175–186.
   Table 2 shows our result. The best performance im-            [8] R. A. Bradley, M. E. Terry, Rank analysis of in-
provements (boldfaced) reach up to 12.291% in cross en-              complete block designs: I. the method of paired
tropy loss and 14.764% in prediction accuracy. The av-               comparisons, Biometrika 39 (1952) 324–345.
erage performance improvements are 4.511% in cross               [9] C. DeLong, N. Pathak, K. Erickson, E. Perrino,
entropy loss and 7.12% in prediction accuracy. Note that             K. Shim, J. Srivastava, Teamskill: Modeling team
across all datasets and metrics, it is empirically demon-            chemistry in online multi-player games, Pacific-
strated that the reward-penalty structure by modules R               Asia Conference on Knowledge Discovery and Data
and P brings consistent improvements.                                Mining (2011) 519–531.
                                                                [10] M. Jang, S. Kim, C. Suh, S. Oh, Optimal sample com-
                                                                     plexity of 𝑀-wise data for top-𝐾 ranking, Advances
6. Conclusion                                                        in Neural Information Processing Systems (2017)
                                                                     1686–1696.
We explore the group match prediction problem where
                                                                [11] Y. Chen, J. Fan, C. Ma, K. Wang, Spectral method
the goal is to predict the probability of one group pre-
                                                                     and regularized MLE are both optimal for top-K
ferred over the other given an unseen pair of groups,
                                                                     ranking, Annals of statistics 47 (2019) 2204–2235.
based on partially observed group comparison data. We
                                                                [12] X. Glorot, A. Bordes, Y. Bengio, Deep sparse rectifier
develop a unified algorithmic framework that employs
                                                                     neural networks, Proceedings of the fourteenth
neural networks and show that it can yield consistently
                                                                     international conference on artificial intelligence
best performances compared to other state-of-the-art
                                                                     and statistics (2011) 315–323.
algorithms on multiple datasets across various domains.
                                                                [13] K. Hornik, M. Stinchcombe, H. White, Multilayer
                                                                     feedforward networks are universal approximators,
Acknowledgments                                                      Neural networks 2 (1989) 359–366.
                                                                [14] X. Glorot, Y. Bengio, Understanding the difficulty
This work was supported by the ICT R&D program of                    of training deep feedforward neural networks, In-
Institute of Information & Communications Technology                 ternational Conference on Machine Learning (2010)
Planning & Evaluation (IITP) grant funded by the Korea               249–256.
government (MSIT) (No. 2020-0-00626).                           [15] D. P. Kingma, J. Ba, Adam: A method for stochas-
                                                                     tic optimization, arXiv preprint arXiv:1412.6980
                                                                     (2014).
References                                                      [16] O. Cooley, M. Kang, C. Koch, Threshold and hit-
                                                                     ting time for high-order connectedness in random
 [1] T. K. Huang, C. J. Lin, R. C. Weng, Generalized
                                                                     hypergraphs, Electronic Journal of Combinatorics
     Bradley-Terry models and multi-class probability
                                                                     (2016) 2–48.
     estimates, Journal of Machine Learning Research 7
                                                                [17] L. Maystre, M. Grossglauser, Just sort it! A simple
     (2006) 85–115.
                                                                     and effective approach to active preference learn-
 [2] T. K. Huang, C. J. Lin, R. C. Weng, Ranking indi-
                                                                     ing, International Conference on Machine Learning
     viduals by group comparisons, Journal of Machine
                                                                     (2017) 2344–2353.
     Learning Research 9 (2008) 2187–2216.
                                                                [18] O. Delalleau, E. Contal, E. Thibodeau-Laufer, R. C.
 [3] Y. Li, M. Cheng, K. Fujii, F. Hsieh, C.-J. Hsieh, Learn-
                                                                     Ferrari, Y. Bengio, F. Zhang, Beyond skill rating:
     ing from group comparisons: Exploiting higher or-
                                                                     Advanced matchmaking in ghost recon online, IEEE
     der interactions, Advances in Neural Information
                                                                     Transactions on Computational Intelligence and AI
     Processing Systems (2018) 4981–4990.
                                                                     in Games 4 (2012) 167–177.