Evaluating Group Recommendation Strategies in
                Memory-Based Collaborative Filtering

                         Nadia A. Najjar                                              David C. Wilson
            University of North Carolina at Charlotte                     University of North Carolina at Charlotte
                   9201 University City Blvd.                                    9201 University City Blvd.
                       Charlotte, NC, USA                                            Charlotte, NC, USA
                      nanajjar@uncc.edu                                              davils@uncc.edu

ABSTRACT                                                               given to the issue of group recommendation [14, 4]. Group
                                                                       recommender systems must manage and balance preferences
Group recommendation presents significant challenges in evo-           from individuals across a group of users with a common pur-
lving best practice approaches to group modeling, but even             pose, in order to tailor choices, options, or information to the
moreso in dataset collection for testing and in developing             group as a whole. Group recommendations can help to sup-
principled evaluation approaches across groups of users. Ear-          port a variety of tasks and activities across domains that
ly research provided more limited, illustrative evaluations for        have a social aspect with shared-consumption needs. Com-
group recommender approaches, but recent work has been                 mon examples arise in social entertainment: finding a movie
exploring more comprehensive evaluative techniques. This               or a television show for family night, date night, or the like
paper describes our approach to evaluate group-based rec-              [22, 11, 23]; finding a restaurant for dinner with work col-
ommenders using data sets from traditional single-user col-            leagues, family, or friends [17]; finding a dish to cook that
laborative filtering systems. The approach focuses on clas-            will satisfy the whole group [5], the book the book club
sic memory-based approaches to collaborative filtering, ad-            should read next, the travel destination for the next family
dressing constraints imposed by sparsity in the user-item              vacation [19, 2, 13], or the songs to play at any social event
matrix. In generating synthetic groups, we model ‘actual’              or at any shared public space [24, 3, 7, 9, 18].
group preferences for evaluation by precise rating agreement              Group recommenders have been distinguished from single
among members. We evaluate representative group aggrega-               user recommenders primarily by their need for an aggre-
tion strategies in this context, providing a novel comparison          gation mechanism to represent the group. A considerable
point for earlier illustrative memory-based results and for            amount of research in group-based recommenders concen-
more recent model-based work, as well as for models of ac-             trates on the techniques used for a recommendation strat-
tual group preference in evaluation.                                   egy, and two main group recommendation strategies have
                                                                       been proposed [14]. The first strategy merges the individual
Categories and Subject Descriptors                                     profiles of the group members into one group representative
H.4 [Information Systems Applications]: Miscellaneous;                 profile, while the second strategy merges the recommenda-
D.2.8 [Software Engineering]: Metrics—complexity mea-                  tion lists or predictions computed for each group member
sures, performance measures                                            into one recommendation list presented to the group. Both
                                                                       strategies utilize recommendation approaches validated for
General Terms                                                          individual users, leaving the aggregation strategy as a dis-
                                                                       tinguishing area of study applicable for group-based recom-
Algorithms, Experimentation, Standardization                           menders.
                                                                          Group recommendation presents significant challenges in
Keywords                                                               evolving best practice approaches to group modeling, but
Evaluation, Group recommendation, Collaborative filtering,             even moreso in dataset collection for testing and in develop-
Memory-based                                                           ing principled evaluation approaches across groups of users.
                                                                       Early research provided more limited, illustrative evalua-
                                                                       tions for group recommender approaches (e.g., [18, 20, 17]),
1.   INTRODUCTION                                                      but recent work has been exploring more comprehensive
  Recommender systems have traditionally focused on the                evaluative techniques (e.g., [4, 8, 1]). Broadly, evaluations
individual user as a target for personalized information fil-          have been conducted either via live user studies or via syn-
tering. As the field has grown, increasing attention is being          thetic dataset analysis. In both types of evaluation, deter-
                                                                       mining an overall group preference to use as ground truth in
                                                                       measuring recommender accuracy presents a complementary
                                                                      aggregation problem to group modeling for generating rec-
                                                                      ommendations. Based on group interaction and group choice
                                                                      outcomes, either a gestalt decision is rendered for the group
                                                                      as a whole, or individual preferences are elicited and com-
                                                                      bined to represent the overall group preference. The former

WOODSTOCK ’97 El Paso, Texas USA                                       lends itself to user studies in which the decision emerges from


                                                                  43
group discussion and interaction, while the latter lends itself        evaluation using the proposed framework. Finally section 5
to synthetic group analysis. Currently, the limited deploy-            outlines our results and discussion.
ment of group recommender systems coupled with the addi-
tional overhead of bringing groups together for user studies
has constrained the availability of data sets that can be used
                                                                       2.     RELATED WORK
to evaluate group based recommenders. Thus as with other                  Previous research that involves evaluation of group recom-
group evaluation e↵orts [4], we adopt the approach of gen-             mendation approaches falls into two primary categories. The
erating synthetic groups for larger scale evaluation.                  first category employs synthetic datasets, generated from ex-
   It is important to note that there are two distinct group           isting single-user datasets (typically MovieLens1 ). The sec-
modeling issues at play. The first is how to model a group             ond category focuses on user studies.
for the purpose of making recommendations (i.e., what a
group’s preference outcome will be). We refer to this as the
                                                                       2.1     Group Aggregation Strategies
recommendation group preference model (RGPM). The sec-                    Various group modeling strategies for making recommen-
ond is how to determine an “actual” group preference based             dations have been proposed and tested to aggregate the in-
on outcomes in user data, in order to represent ground truth           dividual group user’s preferences into a recommendation for
for evaluation purposes (i.e., what a group’s preference out-          the group. Mastho↵ [16] evaluated eleven strategies inspired
come was). We refer to this as the actual group preference             from social choice theory. Three representative strategies are
model (AGPM). For example, it might be considered a triv-              average strategy, least misery, and most happiness.
ial recommendation if each group member had previously
                                                                            • Average Strategy: this is the basic group aggregation
given a movie the same strong rating across the board. How-
                                                                              strategy that assumes equal influence among group
ever, such an agreement point is ideal for evaluating whether
                                                                              members and calculates the average rating of the group
that movie should have been recommended for the group.
                                                                              members for any given item as the predicted rating.
   In evaluating group-based recommenders, the primary con-
                                                                              Let n be the number of users in a group and rij be the
text includes choices made about:
                                                                              rating of user j for item i, then the group rating for
   • the underlying recommendation strategy (e.g., content-                   item i is computed as follows:
     based, collaborative memory-based or model-based)                                                  Pn
                                                                                                           j=1 rji
   • group modeling for making recommednations — RGPM                                            Gri =                            (1)
     (e.g., least misery)                                                                                   n
   • determining actual group preferences for evaluative com-               • Least Misery Strategy: this aggregation strategy is ap-
     parison to system recommendations — AGPM (e.g.,                          plicable in situations where the recommender system
     choice aggregation)                                                      needs to avoid presenting an item that was really dis-
   • choices about metrics for assessment (e.g., ranking,                     liked by any of the group members, i.e., that goal is to
     rating value).                                                           please the least happy member. The predicted rating
                                                                              is calculated as the lowest rating of for any given item
Exploring the group recommendation space involves evalu-                      among group members and computed as follows:
ation across a variety of such contexts.
   To date, we are not aware of a larger-scale group rec-                                           Gri = min rji                    (2)
                                                                                                             j
ommender evaluation using synthetic data sets that (1) fo-
cuses on traditional memory-based collaborative filtering or                • Most Happiness: this aggregation strategy is the oppo-
(2) employs precise overlap across individual user ratings                    site of the least misery strategy. It applies in situations
for evaluating actual group preference. Given the founda-                     where the group is as happy as their happiest member
tional role of classic user-based [22] collaborative filtering                and computed as follows:
in recommender systems, we are interested in understand-
ing the behavior of group recommendation in this context                                            Gri = max rji                    (3)
                                                                                                             j
as a comparative baseline for evaluation. Given that addi-
tional inference to determine “ground truth” preference for            2.2     Evaluation with Synthetic Groups
synthetic groups can potentially decrease precision in eval-              Recent work by Baltrunas [4] used simulated groups to
uation, we are interested in comparing results when group              compare aggregation strategies of ranked lists produced by
members agree precisely in original ratings data.                      a model based collaborative filtering methodology using ma-
   In this paper, we focus on traditional memory-based ap-             trix factorization with gradient descent (SVD). This ap-
proaches to collaborative filtering, addressing constraints            proach addresses sparsity issues for user similarity. The
imposed by sparsity in the user-item matrix. In generating             MovieLens data set was used to simulate groups of di↵er-
valid synthetic groups, we model actual group preferences              ent sizes (2, 3, 4, 8) and di↵erent degrees of similarity (high,
by direct rating agreement among members. Prediction ac-               random). They employed a ranking evaluation metric, mea-
curacy is measured using root mean squared error and mean              suring the e↵ectiveness of the predicted rank list using Nor-
average error. We evaluate the performance of three repre-             malized Discounted Cumulative Gain (nDCG). To account
sentative group aggregation strategies (average, least misery,         for the sparsity in the rating matrix nDCG was computed
most happiness) [15] in this context, providing a novel com-           only over the items that appeared in the target user test set.
parison point for earlier illustrative memory-based results,           The e↵ectiveness of the group recommendation was mea-
for more recent model-based work, and for models of ac-                sured as the average e↵ectiveness (nDCG) of the group mem-
tual group preference in evaluation. This paper is organized           bers where a higher nDCG indicated better performance.
as follows: section 2 overviews related researches. Section 3
                                                                       1
outlines our group testing framework. Section 4 provides the               www.movielens.org


                                                                  44
   Chen et al. [8] also used simulated groups and addressed              tion strategies, including aggregated models and aggregated
the sparsity in user-rating matrix by predicting the missing             predictions. Their aggregated predictions strategy combined
ratings of items belonging in the union set of items rated by            the predictions produced for each of the group members into
group members. They simulated 338 random groups from                     one prediction using a weighted, linear combination of these
the MovieLens data set and used it for evaluating the use                predictions. Evaluation consisted of 170 users where a 108
of Genetic Algorithms to exploit single user ratings as well             of them belonged to a family group with size ranges between
as item ratings given by groups to model group interactions              1 and 4.
and find suitable items that can be considered neighbors in
their implemented neighborhood-based CF.                                 2.4    Establishing Group Preference
   Amer-Yahia et al. [1] also simulated groups from Movie-                  A major question that must be addressed in evaluating
Lens. The simulated groups where used to measure the per-                group recommender systems is how to establish the actual
formance of di↵erent strategies centered around a top-k TA               group preference in order to compare accuracy with system
algorithm. To generate groups a similarity level was spec-               predictions. Previous work by [4, 8, 1] simulated groups
ified, groups were formed from users that had a similarity               from single-user data sets. Their simulated group creation
value within a 0.05 margin. They varied the group similar-               was limited to groups of di↵erent sizes (representing small,
ity between 0.3, 0.5, 0.7 and 0.9 and the size 3, 5 and 8. It            medium and large) with certain degrees of similarity (ran-
was unclear how actual group ratings were established for                dom, homogeneous and heterogeneous ). Chen et al. [8]
the simulated groups or how many groups were created.                    used a baseline aggregation as the ground truth while [4]
                                                                         compares the e↵ectiveness of the group-based recommenda-
2.3    Evaluation with User Studies                                      tion to the e↵ectiveness of the individual recommendations
                                                                         made to each member in the group. This led to our work
   Mastho↵ [15] employed user studies, not to evaluate spe-
                                                                         in investigating ways to create synthesized groups from the
cific techniques, but to determine which group aggregation
                                                                         most commonly used CF single-user data sets taking into
strategies people actually use. Thirty-nine human subjects
                                                                         consideration the ability to identify and establish ground
were given the same individual rating sets from three people
                                                                         truth. We propose a novel Group Testing Framework that
on a collection of video clips. Subjects were asked to decide
                                                                         allows for the creation of synthesized groups that can be
which clips the group should see given time limitations for
                                                                         used for testing in memory-based CF recommenders. In the
viewing only 1, 2, 3, 4, 5, 6, or 7 clips, respectively. In addi-
                                                                         remainder of the paper we give an overview of our proposed
tion, why they made that selection. Results indicated that
                                                                         Group Testing Framework and we report on the evaluations
people particularly use the following strategies: Average,
                                                                         we conducted using this framework.
Average Without Misery and Least Misery.
                                                                            Overall, larger-scale synthetic evaluations for group rec-
   PolyLens [20] evaluated qualitative feedback and changes
                                                                         ommendation have not focused on traditional memory-based
in user behavior for a basic Least Misery aggregation strat-
                                                                         approaches. This may be because it is cumbersome to ad-
egy. Results showed that while users liked and used group
                                                                         dress group generation, given sparsity constraints in the
recommendation, they disliked the minimize misery strat-
                                                                         user-item matrix. Moreover, only limited attention has been
egy They attributed this to the fact that this social value
                                                                         given to evaluation based on predictions, rather than rank-
function is more applicable to groups of smaller sizes.
                                                                         ing. Our evalution approach addresses these issues.
   Amer-Yahia et al. [1] also ran a user study using Ama-
zon’s Mechanical Turk users, they had a total of 45 users
where various groups were formed of sizes 3 and 8 to repre-              3.    GROUP TESTING FRAMEWORK
sent small and large groups. They established an evaluation                 We have developed a group testing framework in order
baseline by generating a recommendation list using four im-              to support evaluation of group recommender approaches.
plemented strategies. The resulting lists are combined into              The framework is used to generate synthetic groups that are
a single group list of distinct items and were presented to              parametrized to test di↵erent group contexts. This enables
the users for evaluation where a relevance score of 1 was                exploration of various parameters, such as group diversity.
given if the user considered the item suitable for the group             The testing framework consists of two main components.
and 0 otherwise. They employed an nDCG measure to eval-                  The first component is a group model that defines specific
uate their proposed prediction lists consensus function. The             group characteristics, such as group coherence. The sec-
nDCG measure was computed for each group member and                      ond component is a group formation mechanism that applies
the average was considered the e↵ectiveness of the group                 the model to identify compatible groups from an underlying
recommendation.                                                          single-user data set, according to outcome parameters such
   Other work considers social relationships and interactions            as the number of groups to generate.
among group members when aggregating the predictions [10,
8, 21]. They model member interactions, social relation-                 3.1    Group Model
ships, domain expertise, and dissimilarity among the group                  In simulating groups of users, a given group will be defined
members when choosing a group decision strategy. For ex-                 based on certain constraints and characteristics, or group
ample, Recio-Garcia et al. [21] described a group recom-                 model. For example, we might want to test recommenda-
mender system that takes into account the personality types              tions based on di↵erent levels of intra-group similarity or
for the group members.                                                   diversity. For a given dataset, the group model defines the
   Berkovsky and Freyne [5] reported better performance in               space of potential groups for evaluation. While beyond the
the recipe recommendation domain when aggregating the                    scope of this paper, we note that the group model for evalu-
user profiles rather than aggregating individual user predic-            ation could include inter-group constraints (diversity across
tions. They implemented a memory-based recommendation                    groups) as well as intra-group constraints (similarity within
approach comparing the performance of four recommenda-                   groups).


                                                                    45
3.1.1    Group Descriptors                                             corresponding group. We note that there are many poten-
   Gartrell et al. [10] use the term “group descriptors” for           tial approaches to model agreement among group members.
specific individual group characteristics (social, expertise,          In this implementation we choose the most straightforward
dissimilarity) to be accounted for within a group model. We            approach, where the average rating among group members
adopt the group descriptor convention to refer to any quan-            is equal to the individual group member rating for that item
tifiable group characteristic that can reflect group structure         as a baseline for evaluation. We do not currently eliminate
and formation. Some of these group descriptors that can                “universally popular” items, but enough test items are iden-
reflect group structure are user-user correlation, number of           tified that we do not expect such items to make a significant
co-rated items between users and demographics such as age              di↵erence. A common practice in evaluation frameworks is
di↵erence. We use these group descriptors to identify rela-            to divide data sets into test and target data sets. In this
tionships between user pairs within a single user data set.            framework the test data set for each group would consist of
                                                                       the identified common item or items for that group.
3.1.2    Group Threshold Matrix
   A significant set of typical group descriptors can be evalu-        4.    EVALUATION
ated on a pairwise basis between group members. For exam-
ple, group coherence can be defined as a minimum degree of             4.1    Baseline Collaborative Filtering
similarity between group members, or a minimum number of
commonly rated items. We employ such pairwise group de-                   We implement the most prevalent memory-based CF algo-
scriptors as a foundational element in generating candidate            rithm, neighborhood-based CF algorithm [12, 22]. The basis
groups for evaluation. We operationalize these descriptors             for this algorithm is to calculate the similarity, wab , which
in a binary matrix data structure, referred to as the Group            reflects the correlation between two users a and b. We mea-
Threshold Matrix (GTM). The GTM is a square n ⇥ n sym-                 sure this correlation by computing the Pearson correlation
metric matrix, where n is the number of users in the system,           defined as:
and the full symmetric matrix is employed for group gener-                             Pn
ation. A single row or column corresponds to a single user,                                  [(rai ra )(rbi rb )]
                                                                                wab = Pn i=1
                                                                                     p                  P                        (4)
and a binary cell value represents whether the full set of                                (r      r a )2 n         r b )2
                                                                                       i=1 ai             i=1 (rbi
pairwise group descriptors holds between the respectively
paired users.                                                             To generate predictions a subset of the nearest neighbors
   To populate the GTM, pairwise group descriptors are eval-           of the active user are chosen based on their correlation.
uated across each user pair in a given single-user dataset.               We then calculate a weighted aggregate of their ratings
The GTM enables efficient storage and operations for test-             to generate predictions for that user. We use the following
ing candidate group composition. A simple lookup indicates             formula to calculate the prediction of item i for user a:
whether two users can group. A bitwise-AND operation on
                                                                                                 Pn
those two user rows indicates which (and how many) other                                           b=1 [(r bi  rb ) · wab ]
users they can group with together. A further bitwise-AND                           pai = ra +         Pn                        (5)
                                                                                                              w
                                                                                                           b=1 ab
with a third user indicates which (and how many) other
users the three can group with together, and so on. Com-                 Herlocker et al. [12] noted that setting a maximum for the
posing such row- (or column-) wise operations provides an              neighborhood size less than 20 negatively a↵ects the accu-
efficient foundation for a generate-and-test approach to cre-          racy of the recommender systems. They recommend setting
ating candidate groups from pairwise group descriptors.                a maximum neighborhood size in the range of 20 to 60. We
                                                                       set the neighborhood size to 50 we also set that as the min-
3.2     Group Formation                                                imum neighborhood size for each member of the groups we
   Once the group model is constructed it can be applied               considered for evaluation. Breese et al. [6] reported that
to generate groups from any common CF user-rating data                 neighbors with higher similarity correlation with the target
models as the underlying data source. The group formation              user can be exceptionally more valuable as predictors than
mechanism applies the set of group descriptors to gener-               those with the lower similarity values. We set this threshold
ate synthetic groups that are valid for the group model. It            to 0.5 and we only consider the ones based on 5 or more
conducts an exhaustive search through the space of poten-              co-rated items.
tial groups, employing heuristic pruning to limit the number
of groups considered. Initially, individual users are filtered         4.2    Group Prediction Aggregation
based on group descriptors that can be applied to single                 Previous group recommender research has focused on sev-
users (e.g., minimum number of items rated). The GTM                   eral group aggregation strategies for combining individual
is generated for remaining users. Baseline pairwise group              predictions. We evaluate the three group aggregation strate-
descriptors are then used to eliminate some individual users           gies which are outlined in section 2.1 as representative RGPMs.
from further consideration (e.g., minimum group size). The             We compare the performance of these three aggregation strate-
GTM is used to generate-and-test candidate groups for a                gies with respect to group characteristics: group size and the
given group size.                                                      degree of similarity within the group.
   To address the issue of modeling actual group preferences
for evaluating system predictions, the framework is tuned to           4.3    Data Set
identify groups where all group members gave at least one                 To evaluate the accuracy of an aggregated predicted rat-
co-rated item the exact same rating among all group mem-               ing for a group we use the MovieLens 100K ratings and 943
bers. Such identified “test items” become candidates for the           users data set. Simulated groups were created based on dif-
testing set in the evaluation process in conjunction with the          ferent thresholds defined for the group descriptors. The two


                                                                  46
        Table 1: Degrees of Group Similarity
               Similarity Definition
               Level      8i, j 2 G
               High       wij 0.5
               Medium     0.5 > wij 0
               Low        0 > wij


  Table 2: Similarity Statistics for Test Data Set
   Degree of Number of Valid Average User-User
   Similarity   Correlations         Similarity
     High          39,650               0.65
    Medium        192,522               0.22
     Low           95,739              -0.25


descriptors we varied were group size and degree of similar-
ity among group members. We presume the same data set
that is used to create the simulated groups is the same data                Figure 1: RMSE - High degree of similarity.
set used to evaluate recommendation techniques.
   By varying the thresholds of the group descriptors used to
create the group threshold matrix we were able to represent            mentation of this framework the group descriptors used to
groups of di↵erent characteristics, which we then used to              define inputs for the group threshold matrix are the user-
find and generate groups for testing. One aspect we wanted             user correlation and the number of co-rated items between
to investigate is the a↵ect of group homogeneity and size on           any user pair. This forms the group model element of the
the di↵erent aggregation methods used to predict a rating              testing framework. For the group formation element we var-
score for a group using the baseline CF algorithms defined in          ied the groups size and for each group the similarity category,
section 4.1. To answer this question we varied the threshold           5000 testable groups were identified (with at least one com-
for the similarity descriptor and then varied the size of the          mon rating across group members). A predicted rating was
group from 2 to 5. We defined three similarity levels: high,           computed for each group member and those values were ag-
medium and low similarity groups as outlined in Table 1                gregated to produce a final group predicted rating. Table 3
where the inner similarity correlation between any two users           gives an overview of the number of di↵erent group combi-
i, j belonging to group G is calculated as defined in equation         nations the framework needs to consider to identify valid,
1.                                                                     and testable groups. The framework exploits the possible
   To ensure significance of the calculated similarity correla-        combinations to identify groups where the group descriptors
tions we only consider user pairs that have at least 5 common          defined are valid between every user pair belonging to that
rated items. For the MovieLens data set used we have a total           group this is then depicted in the GTM.
of 444153 distinct correlations (943 taking two combinations              We then utilized the testing framework to assess the pre-
at a time). For the three similarity levels defined previously         dicted rating computed for a group based on the three de-
the total correlation and average correlation are outlined in          fined aggregation strategies in section 4.2. We compared
Table 2.                                                               the group predicted rating calculated for the test item to
   Table 3 reflects the GTM group generation statistics for            the actual rating using MAE and RMSE across the di↵erent
the underlying data set used in our evaluation. Total com-             aggregation methods.
binations field indicate the number of possible group com-                It is worth noting here that just like any recommendation
binations that can be formed giving user pairs that satisfy            technique quality depends on the quality of the input data,
our group size threshold descriptor. The valid groups field            the quality of the generated test set depends on the quality of
indicates the number of possible groups that satisfy both              the underlying individual ratings data set when it comes to
the size and similarity threshold whereas the testable groups          the ability to generate predictions. For example, prediction
are valid groups with at least one identified test item as de-         accuracy and quality decrease due to sparsity in the original
scribed in section 3.2. As we increase the size of the groups          data set.
to be created the number of combinations the implementa-
tion has to check increases significantly. We can also see             5.   RESULTS AND DISCUSSION
that the number of testable groups is large in comparison                Our evaluation goal is to test group recommendation based
to the number of groups used in actual user studies. As of             on traditional memory-based collaborative filtering techniques,
this writing and due to system restrictions we were able to            in order to provide a basis of comparison that covers (1)
generate all testable groups for group size 2 and 3 across             synthetic group formation for this type of approach, and (2)
all similarity levels, group size 4 for low and high similarity        group evaluation based on prediction rather than ranking.
level and group size 5 for the high similarity level.                  We hypothesize that aggregation results will support previ-
                                                                       ous research for the aggregation strategies tested. In doing
4.4    The Testing Framework                                           so, we investigate the relationship between the group’s co-
  The framework creates a Group Threshold Matrix based                 herence, size and the aggregation strategy used. Figures
on the group descriptor conditions defined. In our imple-              1-6 reflect the MAE and RMSE for these evaluated rela-


                                                                  47
                                       Table 3: Group Threshold Matrix Statistics
                                                             2         3            4                   5
           High Similarity >= 0.5      Total Combinations 39,650   1,351,657   40,435,741         1,087,104,263
                                       Valid Groups        39,650   226,952      417,948             390,854
                                       Testable Groups     37,857   129,826      129,851              71,441
           Medium >=0 < 0.5            Total combinations 192,522 30,379,236 3,942,207,750       434,621,369,457
                                       Valid groups       192,522 17,097,527
                                       Testable groups    187,436 11,482,472
           Low similarity < 0.0        Total combinations  95,739  7,074,964  421,651,608        21,486,449,569
                                       Valid groups        95,739  1,641,946    6,184,151
                                       Testable groups     87,642   470,257      283,676


     Figure 2: MAE - High degree of similarity.

                                                                        Figure 3: RMSE - Medium degree of similarity.
tionships. Examining the graphs for the groups with high
similarity levels, Figures 1 and 2 show that average strat-
egy and most happiness perform better than least misery.
We conducted a t-test to evaluate the results significance
and found that both MAE and RMSE for average and most
happiness strategies, across all group sizes, significantly out-
perform the least misery strategy (p<0.001 ). For group sizes
2 and 3 there was no significant di↵erence between the av-
erage and most happiness strategies (p>0.01 ). For group
sizes 4 and 5 most happiness strategy performs better than
the average strategy (p<0.001 ). Both least happiness and
average strategies performance decreases as the group size
grows. This indicates that a larger group of highly similar
people are as happy as their happiest member.
   Figures 3 and 4 show the RMSE and MAE for groups with
medium similarity levels. The average strategy performs
significantly better than most happiness and least misery
across group sizes 2,3 and 4 (p<0.001 ). For the groups
of size 5 there was no significant di↵erence between aver-
age and most happiness strategies (p>0.01 ). For groups
with medium similarity level the least misery strategy per-
formance is similar to the groups with high coherency levels.
   Figures 5 and 6 show the results for the groups with
low similarity level. Examining the RMSE and MAE in
these graphs the average strategy performs best across all              Figure 4: MAE - Medium degree of similarity.
group sizes compared to the other two strategies. MAE and
RMSE for the average strategy for all group sizes with low


                                                                   48
     Figure 5: RMSE - Low degree of similarity.                           Figure 6: MAE - Low degree of similarity.


coherency had a statistically significant p value (p<0.001 )         testing in this domain. Our work provides novel cover-
compared to both least misery and most happiness strate-             age in the group recommender evaluation space, considering
gies. Inconsistent with the groups with high coherency, for          (1) focus on traditional memory-based collaborative filter-
groups with low coherency the most happiness performance             ing, and (2) employs precise overlap across individual user
starts to decrease as the group size increases while the per-        ratings for evaluating actual group preference. We evalu-
formance of the least misery strategy starts to increase.            ated our framework with a foundational Collaborative Fil-
   These evaluation results indicate that in situations where        tering neighborhood-based approach, prediction accuracy,
groups are formed with highly similar members most happi-            and three representative group prediction aggregation strate-
ness aggregation strategy would be best to model the RGPM            gies. Our results show that for small-sized groups with high-
while for groups with medium to low coherency average                similarity among their members average and most happiness
strategy would be best. These results using the 5000 syn-            perform the best. For larger size groups with high-similarity
thesized groups for each category coincide with the results          performs most happiness performs better. For the low and
reported by Gartrell using real subjects. Gartrell defined           medium similarity groups, average strategy has the best per-
groups based on the social relationships between the group           formance. Overall, this work has helped to extend the cov-
members. They identified three levels of social relationships        erage of group recommender evaluation analysis, and we ex-
(couple, acquaintance and first-acquaintance) that might ex-         pect this will provide a novel point of comparison for further
ist between group members. In their study to compare the             developments in this area. Going forward we plan to evalu-
performance of the three aggregation strategies across these         ate various parameterizations of our testing framework such
social ties, they reported that for the groups of two members        as more flexible AGPM metrics (e.g. normalizing the ratings
with a social tie defined as couple the most happiness strat-        of the individual users).
egy outperforms the other two. For the acquaintance groups,
these groups had 3 members, the average strategy performs            7.   REFERENCES
best, while for the first-acquaintance, they had one group            [1] S. Amer-yahia, S. B. Roy, A. Chawla, G. Das, and
with 12 members, the least misery strategy outperforms the                C. Yu. Group recommendation: Semantics and
best. It is apparent that their results for the couple groups             efficiency. Proceedings of The Vldb Endowment,
performance is equivalent to our high-coherency groups, the               2:754–765, 2009.
acquaintance groups maps to the medium-coherency groups
                                                                      [2] L. Ardissono, A. Goy, G. Petrone, M. Segnan, and
while the first-acquaintance groups follow the low-coherency
                                                                          P. Torasso. Intrigue: Personalized recommendation of
groups. Mastho↵ studies reported that people usually used
                                                                          tourist attractions for desktop and hand held devices.
average strategy and least misery since they valued fairness
                                                                          Applied Artificial Intelligence, pages 687–714, 2003.
and preventing misery. It is worth noting that her studies
                                                                      [3] C. Baccigalupo and E. Plaza. Poolcasting: A social
evaluated these strategies for groups of size 3 only without
                                                                          web radio architecture for group customisation. In
any reference to coherency levels.
                                                                          Proceedings of the Third International Conference on
                                                                          Automated Production of Cross Media Content for
6.   CONCLUSION                                                           Multi-Channel Distribution, pages 115–122,
  As group-based recommender systems become more preva-                   Washington, DC, USA, 2007. IEEE Computer Society.
lent, there is an increasing need for evaluation approaches           [4] L. Baltrunas, T. Makcinskas, and F. Ricci. Group
and data sets to enable more extensive analysis of such sys-              recommendations with rank aggregation and
tems. In this paper we developed a group testing framework                collaborative filtering. In Proceedings of the fourth
that can help address the problem by automating group                     ACM conference on Recommender systems, RecSys
formation resulting in generation of groups applicable for                ’10, pages 119–126, New York, NY, USA, 2010. ACM.


                                                                49
 [5] S. Berkovsky and J. Freyne. Group-based recipe                     B. Smyth, and P. Nixon. Cats: A synchronous
     recommendations: analysis of data aggregation                      approach to collaborative group recommendation.
     strategies. In Proceedings of the fourth ACM                       pages 86–91, Melbourne Beach, Florida, USA,
     conference on Recommender systems, RecSys ’10,                     11/05/2006 2006. AAAI Press, AAAI Press.
     pages 111–118, New York, NY, USA, 2010. ACM.                  [20] M. O’Connor, D. Cosley, J. A. Konstan, and J. Riedl.
 [6] J. S. Breese, D. Heckerman, and C. M. Kadie.                       Polylens: a recommender system for groups of users.
     Empirical analysis of predictive algorithms for                    In Proceedings of the seventh conference on European
     collaborative filtering. In G. F. Cooper and S. Moral,             Conference on Computer Supported Cooperative Work,
     editors, Proceedings of the 14th Conference on                     pages 199–218, Norwell, MA, USA, 2001. Kluwer
     Uncertainty in Artificial Intelligence, pages 43–52,               Academic Publishers.
     1998.                                                         [21] J. A. Recio-Garcia, G. Jimenez-Diaz, A. A.
 [7] D. L. Chao, J. Balthrop, and S. Forrest. Adaptive                  Sanchez-Ruiz, and B. Diaz-Agudo. Personality aware
     radio: achieving consensus using negative preferences.             recommendations to groups. In Proceedings of the third
     In Proceedings of the 2005 international ACM                       ACM conference on Recommender systems, RecSys
     SIGGROUP conference on Supporting group work,                      ’09, pages 325–328, New York, NY, USA, 2009. ACM.
     GROUP ’05, pages 120–123, New York, NY, USA,                  [22] P. Resnick, N. Iacovou, M. Sushak, P. Bergstrom, and
     2005. ACM.                                                         J. Riedl. Grouplens: An open architecture for
 [8] Y.-L. Chen, L.-C. Cheng, and C.-N. Chuang. A group                 collaborative filtering of netnews. In 1994 ACM
     recommendation system with consideration of                        Conference on Computer Supported Collaborative
     interactions among group members. Expert Syst.                     Work Conference, pages 175–186, Chapel Hill, NC,
     Appl., 34:2082–2090, April 2008.                                   10/1994 1994. Association of Computing Machinery,
 [9] A. Crossen, J. Budzik, and K. J. Hammond. Flytrap:                 Association of Computing Machinery.
     intelligent group music recommendation. In                    [23] C. Senot, D. Kostadinov, M. Bouzid, J. Picault,
     Proceedings of the 7th international conference on                 A. Aghasaryan, and C. Bernier. Analysis of strategies
     Intelligent user interfaces, IUI ’02, pages 184–185,               for building group profiles. In P. De Bra, A. Kobsa,
     New York, NY, USA, 2002. ACM.                                      and D. Chin, editors, User Modeling, Adaptation, and
[10] M. Gartrell, X. Xing, Q. Lv, A. Beach, R. Han,                     Personalization, volume 6075 of Lecture Notes in
     S. Mishra, and K. Seada. Enhancing group                           Computer Science, pages 40–51. Springer Berlin /
     recommendation by incorporating social relationship                Heidelberg, 2010.
     interactions. In Proceedings of the 16th ACM                  [24] D. Sprague, F. Wu, and M. Tory. Music selection
     international conference on Supporting group work,                 using the partyvote democratic jukebox. In
     GROUP ’10, pages 97–106, New York, NY, USA,                        Proceedings of the working conference on Advanced
     2010. ACM.                                                         visual interfaces, AVI ’08, pages 433–436, New York,
[11] D. Goren-Bar and O. Glinansky. Fit-recommend ing                   NY, USA, 2008. ACM.
     tv programs to family members. Computers &
     Graphics, 28(2):149 – 156, 2004.
[12] J. Herlocker, J. A. Konstan, and J. Riedl. An
     empirical analysis of design choices in
     neighborhood-based collaborative filtering algorithms.
     Inf. Retr., 5:287–310, October 2002.
[13] A. Jameson. More than the sum of its members:
     challenges for group recommender systems. In
     Proceedings of the working conference on Advanced
     visual interfaces, AVI ’04, pages 48–54, New York,
     NY, USA, 2004. ACM.
[14] A. Jameson and B. Smyth. The adaptive web. chapter
     Recommendation to groups, pages 596–627.
     Springer-Verlag, Berlin, Heidelberg, 2007.
[15] J. Mastho↵. Group modeling: Selecting a sequence of
     television items to suit a group of viewers. User
     Modeling and User-Adapted Interaction, 14:37–85,
     February 2004.
[16] J. Mastho↵. Group recommender systems: Combining
     individual models. In F. Ricci, L. Rokach, B. Shapira,
     and P. B. Kantor, editors, Recommender Systems
     Handbook, pages 677–702. Springer US, 2011.
[17] J. F. McCarthy. Pocket restaurantfinder: A situated
     recommender system for groups. pages 1–10, 2002.
[18] J. F. McCarthy and T. D. Anagnost. Musicfx: an
     arbiter of group preferences for computer supported
     collaborative workouts. In CSCW, page 348, 2000.
[19] K. McCarthy, M. SalamÃş, L. Coyle, L. McGinty,


                                                              50