<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Evaluating Group Recommendation Strategies in Memory-Based Collaborative Filtering</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Nadia A. Najjar</string-name>
          <email>nanajjar@uncc.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>David C. Wilson</string-name>
          <email>davils@uncc.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of North Carolina at Charlotte, 9201 University City Blvd.</institution>
          ,
          <addr-line>Charlotte, NC</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <fpage>43</fpage>
      <lpage>50</lpage>
      <abstract>
        <p>Group recommendation presents significant challenges in evolving best practice approaches to group modeling, but even moreso in dataset collection for testing and in developing principled evaluation approaches across groups of users. Early research provided more limited, illustrative evaluations for group recommender approaches, but recent work has been exploring more comprehensive evaluative techniques. This paper describes our approach to evaluate group-based recommenders using data sets from traditional single-user collaborative filtering systems. The approach focuses on classic memory-based approaches to collaborative filtering, addressing constraints imposed by sparsity in the user-item matrix. In generating synthetic groups, we model 'actual' group preferences for evaluation by precise rating agreement among members. We evaluate representative group aggregation strategies in this context, providing a novel comparison point for earlier illustrative memory-based results and for more recent model-based work, as well as for models of actual group preference in evaluation. H.4 [Information Systems Applications]: Miscellaneous; D.2.8 [Software Engineering]: Metrics-complexity measures, performance measures General Terms</p>
      </abstract>
      <kwd-group>
        <kwd>Evaluation</kwd>
        <kwd>Group recommendation</kwd>
        <kwd>Collaborative filtering</kwd>
        <kwd>Memory-based</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Categories and Subject Descriptors</title>
    </sec>
    <sec id="sec-2">
      <title>1. INTRODUCTION</title>
      <p>
        Recommender systems have traditionally focused on the
individual user as a target for personalized information
filtering. As the field has grown, increasing attention is being
WOODSTOCK ’97 El Paso, Texas USA
given to the issue of group recommendation [
        <xref ref-type="bibr" rid="ref14 ref4">14, 4</xref>
        ]. Group
recommender systems must manage and balance preferences
from individuals across a group of users with a common
purpose, in order to tailor choices, options, or information to the
group as a whole. Group recommendations can help to
support a variety of tasks and activities across domains that
have a social aspect with shared-consumption needs.
Common examples arise in social entertainment: finding a movie
or a television show for family night, date night, or the like
[
        <xref ref-type="bibr" rid="ref11 ref22 ref23">22, 11, 23</xref>
        ]; finding a restaurant for dinner with work
colleagues, family, or friends [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ]; finding a dish to cook that
will satisfy the whole group [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], the book the book club
should read next, the travel destination for the next family
vacation [
        <xref ref-type="bibr" rid="ref13 ref19 ref2">19, 2, 13</xref>
        ], or the songs to play at any social event
or at any shared public space [
        <xref ref-type="bibr" rid="ref18 ref24 ref3 ref7 ref9">24, 3, 7, 9, 18</xref>
        ].
      </p>
      <p>
        Group recommenders have been distinguished from single
user recommenders primarily by their need for an
aggregation mechanism to represent the group. A considerable
amount of research in group-based recommenders
concentrates on the techniques used for a recommendation
strategy, and two main group recommendation strategies have
been proposed [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]. The first strategy merges the individual
profiles of the group members into one group representative
profile, while the second strategy merges the
recommendation lists or predictions computed for each group member
into one recommendation list presented to the group. Both
strategies utilize recommendation approaches validated for
individual users, leaving the aggregation strategy as a
distinguishing area of study applicable for group-based
recommenders.
      </p>
      <p>
        Group recommendation presents significant challenges in
evolving best practice approaches to group modeling, but
even moreso in dataset collection for testing and in
developing principled evaluation approaches across groups of users.
Early research provided more limited, illustrative
evaluations for group recommender approaches (e.g., [
        <xref ref-type="bibr" rid="ref17 ref18 ref20">18, 20, 17</xref>
        ]),
but recent work has been exploring more comprehensive
evaluative techniques (e.g., [
        <xref ref-type="bibr" rid="ref1 ref4 ref8">4, 8, 1</xref>
        ]). Broadly, evaluations
have been conducted either via live user studies or via
synthetic dataset analysis. In both types of evaluation,
determining an overall group preference to use as ground truth in
measuring recommender accuracy presents a complementary
aggregation problem to group modeling for generating
recommendations. Based on group interaction and group choice
outcomes, either a gestalt decision is rendered for the group
as a whole, or individual preferences are elicited and
combined to represent the overall group preference. The former
lends itself to user studies in which the decision emerges from
group discussion and interaction, while the latter lends itself
to synthetic group analysis. Currently, the limited
deployment of group recommender systems coupled with the
additional overhead of bringing groups together for user studies
has constrained the availability of data sets that can be used
to evaluate group based recommenders. Thus as with other
group evaluation e↵ orts [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], we adopt the approach of
generating synthetic groups for larger scale evaluation.
      </p>
      <p>It is important to note that there are two distinct group
modeling issues at play. The first is how to model a group
for the purpose of making recommendations (i.e., what a
group’s preference outcome will be). We refer to this as the
recommendation group preference model (RGPM). The
second is how to determine an “actual” group preference based
on outcomes in user data, in order to represent ground truth
for evaluation purposes (i.e., what a group’s preference
outcome was). We refer to this as the actual group preference
model (AGPM). For example, it might be considered a
trivial recommendation if each group member had previously
given a movie the same strong rating across the board.
However, such an agreement point is ideal for evaluating whether
that movie should have been recommended for the group.</p>
      <p>In evaluating group-based recommenders, the primary
context includes choices made about:
• the underlying recommendation strategy (e.g.,
contentbased, collaborative memory-based or model-based)
• group modeling for making recommednations — RGPM
(e.g., least misery)
• determining actual group preferences for evaluative
comparison to system recommendations — AGPM (e.g.,
choice aggregation)
• choices about metrics for assessment (e.g., ranking,
rating value).</p>
      <p>Exploring the group recommendation space involves
evaluation across a variety of such contexts.</p>
      <p>
        To date, we are not aware of a larger-scale group
recommender evaluation using synthetic data sets that (1)
focuses on traditional memory-based collaborative filtering or
(2) employs precise overlap across individual user ratings
for evaluating actual group preference. Given the
foundational role of classic user-based [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ] collaborative filtering
in recommender systems, we are interested in
understanding the behavior of group recommendation in this context
as a comparative baseline for evaluation. Given that
additional inference to determine “ground truth” preference for
synthetic groups can potentially decrease precision in
evaluation, we are interested in comparing results when group
members agree precisely in original ratings data.
      </p>
      <p>
        In this paper, we focus on traditional memory-based
approaches to collaborative filtering, addressing constraints
imposed by sparsity in the user-item matrix. In generating
valid synthetic groups, we model actual group preferences
by direct rating agreement among members. Prediction
accuracy is measured using root mean squared error and mean
average error. We evaluate the performance of three
representative group aggregation strategies (average, least misery,
most happiness) [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] in this context, providing a novel
comparison point for earlier illustrative memory-based results,
for more recent model-based work, and for models of
actual group preference in evaluation. This paper is organized
as follows: section 2 overviews related researches. Section 3
outlines our group testing framework. Section 4 provides the
(1)
(2)
(3)
evaluation using the proposed framework. Finally section 5
outlines our results and discussion.
2.
      </p>
      <sec id="sec-2-1">
        <title>RELATED WORK</title>
        <p>Previous research that involves evaluation of group
recommendation approaches falls into two primary categories. The
first category employs synthetic datasets, generated from
existing single-user datasets (typically MovieLens1). The
second category focuses on user studies.
2.1</p>
      </sec>
      <sec id="sec-2-2">
        <title>Group Aggregation Strategies</title>
        <p>
          Various group modeling strategies for making
recommendations have been proposed and tested to aggregate the
individual group user’s preferences into a recommendation for
the group. Mastho↵ [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ] evaluated eleven strategies inspired
from social choice theory. Three representative strategies are
average strategy, least misery, and most happiness.
• Average Strategy: this is the basic group aggregation
strategy that assumes equal influence among group
members and calculates the average rating of the group
members for any given item as the predicted rating.
Let n be the number of users in a group and rij be the
rating of user j for item i, then the group rating for
item i is computed as follows:
2.2
        </p>
      </sec>
      <sec id="sec-2-3">
        <title>Evaluation with Synthetic Groups</title>
        <p>
          Recent work by Baltrunas [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ] used simulated groups to
compare aggregation strategies of ranked lists produced by
a model based collaborative filtering methodology using
matrix factorization with gradient descent (SVD). This
approach addresses sparsity issues for user similarity. The
MovieLens data set was used to simulate groups of di↵
erent sizes (2, 3, 4, 8) and di↵ erent degrees of similarity (high,
random). They employed a ranking evaluation metric,
measuring the e↵ ectiveness of the predicted rank list using
Normalized Discounted Cumulative Gain (nDCG). To account
for the sparsity in the rating matrix nDCG was computed
only over the items that appeared in the target user test set.
The e↵ ectiveness of the group recommendation was
measured as the average e↵ ectiveness (nDCG) of the group
members where a higher nDCG indicated better performance.
1www.movielens.org
        </p>
        <p>Gri =</p>
        <p>Pn
j=1 rji
n
• Least Misery Strategy: this aggregation strategy is
applicable in situations where the recommender system
needs to avoid presenting an item that was really
disliked by any of the group members, i.e., that goal is to
please the least happy member. The predicted rating
is calculated as the lowest rating of for any given item
among group members and computed as follows:
• Most Happiness: this aggregation strategy is the
opposite of the least misery strategy. It applies in situations
where the group is as happy as their happiest member
and computed as follows:</p>
        <p>Gri = min rji</p>
        <p>j
Gri = max rji
j</p>
        <p>
          Chen et al. [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ] also used simulated groups and addressed
the sparsity in user-rating matrix by predicting the missing
ratings of items belonging in the union set of items rated by
group members. They simulated 338 random groups from
the MovieLens data set and used it for evaluating the use
of Genetic Algorithms to exploit single user ratings as well
as item ratings given by groups to model group interactions
and find suitable items that can be considered neighbors in
their implemented neighborhood-based CF.
        </p>
        <p>
          Amer-Yahia et al. [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ] also simulated groups from
MovieLens. The simulated groups where used to measure the
performance of di↵ erent strategies centered around a top-k TA
algorithm. To generate groups a similarity level was
specified, groups were formed from users that had a similarity
value within a 0.05 margin. They varied the group
similarity between 0.3, 0.5, 0.7 and 0.9 and the size 3, 5 and 8. It
was unclear how actual group ratings were established for
the simulated groups or how many groups were created.
2.3
        </p>
        <sec id="sec-2-3-1">
          <title>Evaluation with User Studies</title>
          <p>
            Mastho↵ [
            <xref ref-type="bibr" rid="ref15">15</xref>
            ] employed user studies, not to evaluate
specific techniques, but to determine which group aggregation
strategies people actually use. Thirty-nine human subjects
were given the same individual rating sets from three people
on a collection of video clips. Subjects were asked to decide
which clips the group should see given time limitations for
viewing only 1, 2, 3, 4, 5, 6, or 7 clips, respectively. In
addition, why they made that selection. Results indicated that
people particularly use the following strategies: Average,
Average Without Misery and Least Misery.
          </p>
          <p>
            PolyLens [
            <xref ref-type="bibr" rid="ref20">20</xref>
            ] evaluated qualitative feedback and changes
in user behavior for a basic Least Misery aggregation
strategy. Results showed that while users liked and used group
recommendation, they disliked the minimize misery
strategy They attributed this to the fact that this social value
function is more applicable to groups of smaller sizes.
          </p>
          <p>
            Amer-Yahia et al. [
            <xref ref-type="bibr" rid="ref1">1</xref>
            ] also ran a user study using
Amazon’s Mechanical Turk users, they had a total of 45 users
where various groups were formed of sizes 3 and 8 to
represent small and large groups. They established an evaluation
baseline by generating a recommendation list using four
implemented strategies. The resulting lists are combined into
a single group list of distinct items and were presented to
the users for evaluation where a relevance score of 1 was
given if the user considered the item suitable for the group
and 0 otherwise. They employed an nDCG measure to
evaluate their proposed prediction lists consensus function. The
nDCG measure was computed for each group member and
the average was considered the e↵ ectiveness of the group
recommendation.
          </p>
          <p>
            Other work considers social relationships and interactions
among group members when aggregating the predictions [
            <xref ref-type="bibr" rid="ref10 ref21 ref8">10,
8, 21</xref>
            ]. They model member interactions, social
relationships, domain expertise, and dissimilarity among the group
members when choosing a group decision strategy. For
example, Recio-Garcia et al. [
            <xref ref-type="bibr" rid="ref21">21</xref>
            ] described a group
recommender system that takes into account the personality types
for the group members.
          </p>
          <p>
            Berkovsky and Freyne [
            <xref ref-type="bibr" rid="ref5">5</xref>
            ] reported better performance in
the recipe recommendation domain when aggregating the
user profiles rather than aggregating individual user
predictions. They implemented a memory-based recommendation
approach comparing the performance of four
recommendation strategies, including aggregated models and aggregated
predictions. Their aggregated predictions strategy combined
the predictions produced for each of the group members into
one prediction using a weighted, linear combination of these
predictions. Evaluation consisted of 170 users where a 108
of them belonged to a family group with size ranges between
1 and 4.
2.4
          </p>
        </sec>
        <sec id="sec-2-3-2">
          <title>Establishing Group Preference</title>
          <p>
            A major question that must be addressed in evaluating
group recommender systems is how to establish the actual
group preference in order to compare accuracy with system
predictions. Previous work by [
            <xref ref-type="bibr" rid="ref1 ref4 ref8">4, 8, 1</xref>
            ] simulated groups
from single-user data sets. Their simulated group creation
was limited to groups of di↵ erent sizes (representing small,
medium and large) with certain degrees of similarity
(random, homogeneous and heterogeneous ). Chen et al. [
            <xref ref-type="bibr" rid="ref8">8</xref>
            ]
used a baseline aggregation as the ground truth while [
            <xref ref-type="bibr" rid="ref4">4</xref>
            ]
compares the e↵ ectiveness of the group-based
recommendation to the e↵ ectiveness of the individual recommendations
made to each member in the group. This led to our work
in investigating ways to create synthesized groups from the
most commonly used CF single-user data sets taking into
consideration the ability to identify and establish ground
truth. We propose a novel Group Testing Framework that
allows for the creation of synthesized groups that can be
used for testing in memory-based CF recommenders. In the
remainder of the paper we give an overview of our proposed
Group Testing Framework and we report on the evaluations
we conducted using this framework.
          </p>
          <p>Overall, larger-scale synthetic evaluations for group
recommendation have not focused on traditional memory-based
approaches. This may be because it is cumbersome to
address group generation, given sparsity constraints in the
user-item matrix. Moreover, only limited attention has been
given to evaluation based on predictions, rather than
ranking. Our evalution approach addresses these issues.
3.</p>
        </sec>
        <sec id="sec-2-3-3">
          <title>GROUP TESTING FRAMEWORK</title>
          <p>We have developed a group testing framework in order
to support evaluation of group recommender approaches.
The framework is used to generate synthetic groups that are
parametrized to test di↵ erent group contexts. This enables
exploration of various parameters, such as group diversity.
The testing framework consists of two main components.
The first component is a group model that defines specific
group characteristics, such as group coherence. The
second component is a group formation mechanism that applies
the model to identify compatible groups from an underlying
single-user data set, according to outcome parameters such
as the number of groups to generate.
3.1</p>
        </sec>
        <sec id="sec-2-3-4">
          <title>Group Model</title>
          <p>In simulating groups of users, a given group will be defined
based on certain constraints and characteristics, or group
model. For example, we might want to test
recommendations based on di↵ erent levels of intra-group similarity or
diversity. For a given dataset, the group model defines the
space of potential groups for evaluation. While beyond the
scope of this paper, we note that the group model for
evaluation could include inter-group constraints (diversity across
groups) as well as intra-group constraints (similarity within
groups).
corresponding group. We note that there are many
potential approaches to model agreement among group members.
In this implementation we choose the most straightforward
approach, where the average rating among group members
is equal to the individual group member rating for that item
as a baseline for evaluation. We do not currently eliminate
“universally popular” items, but enough test items are
identified that we do not expect such items to make a significant
di↵ erence. A common practice in evaluation frameworks is
to divide data sets into test and target data sets. In this
framework the test data set for each group would consist of
the identified common item or items for that group.</p>
          <p>EVALUATION
4.
4.1</p>
          <p>Baseline Collaborative Filtering</p>
          <p>
            We implement the most prevalent memory-based CF
algorithm, neighborhood-based CF algorithm [
            <xref ref-type="bibr" rid="ref12 ref22">12, 22</xref>
            ]. The basis
for this algorithm is to calculate the similarity, wab, which
reflects the correlation between two users a and b. We
measure this correlation by computing the Pearson correlation
defined as:
wab =
          </p>
          <p>Pin=1[(rai
pPn
i=1(rai
ra)(rbi</p>
          <p>rb)]
ra)2 Pn
i=1(rbi
rb)2</p>
          <p>To generate predictions a subset of the nearest neighbors
of the active user are chosen based on their correlation.</p>
          <p>We then calculate a weighted aggregate of their ratings
to generate predictions for that user. We use the following
formula to calculate the prediction of item i for user a:
3.1.1</p>
          <p>
            Gartrell et al. [
            <xref ref-type="bibr" rid="ref10">10</xref>
            ] use the term “group descriptors” for
specific individual group characteristics (social, expertise,
dissimilarity) to be accounted for within a group model. We
adopt the group descriptor convention to refer to any
quantifiable group characteristic that can reflect group structure
and formation. Some of these group descriptors that can
reflect group structure are user-user correlation, number of
co-rated items between users and demographics such as age
di↵ erence. We use these group descriptors to identify
relationships between user pairs within a single user data set.
3.1.2
          </p>
          <p>Group Threshold Matrix</p>
          <p>A significant set of typical group descriptors can be
evaluated on a pairwise basis between group members. For
example, group coherence can be defined as a minimum degree of
similarity between group members, or a minimum number of
commonly rated items. We employ such pairwise group
descriptors as a foundational element in generating candidate
groups for evaluation. We operationalize these descriptors
in a binary matrix data structure, referred to as the Group
Threshold Matrix (GTM). The GTM is a square n ⇥ n
symmetric matrix, where n is the number of users in the system,
and the full symmetric matrix is employed for group
generation. A single row or column corresponds to a single user,
and a binary cell value represents whether the full set of
pairwise group descriptors holds between the respectively
paired users.</p>
          <p>To populate the GTM, pairwise group descriptors are
evaluated across each user pair in a given single-user dataset.
The GTM enables e cient storage and operations for
testing candidate group composition. A simple lookup indicates
whether two users can group. A bitwise-AND operation on
those two user rows indicates which (and how many) other
users they can group with together. A further bitwise-AND
with a third user indicates which (and how many) other
users the three can group with together, and so on.
Composing such row- (or column-) wise operations provides an
e cient foundation for a generate-and-test approach to
creating candidate groups from pairwise group descriptors.
3.2</p>
          <p>Group Formation</p>
          <p>Once the group model is constructed it can be applied
to generate groups from any common CF user-rating data
models as the underlying data source. The group formation
mechanism applies the set of group descriptors to
generate synthetic groups that are valid for the group model. It
conducts an exhaustive search through the space of
potential groups, employing heuristic pruning to limit the number
of groups considered. Initially, individual users are filtered
based on group descriptors that can be applied to single
users (e.g., minimum number of items rated). The GTM
is generated for remaining users. Baseline pairwise group
descriptors are then used to eliminate some individual users
from further consideration (e.g., minimum group size). The
GTM is used to generate-and-test candidate groups for a
given group size.</p>
          <p>To address the issue of modeling actual group preferences
for evaluating system predictions, the framework is tuned to
identify groups where all group members gave at least one
co-rated item the exact same rating among all group
members. Such identified “test items” become candidates for the
testing set in the evaluation process in conjunction with the
(4)
(5)
pai = ra +
rb) · wab]</p>
          <p>
            Herlocker et al. [
            <xref ref-type="bibr" rid="ref12">12</xref>
            ] noted that setting a maximum for the
neighborhood size less than 20 negatively a↵ ects the
accuracy of the recommender systems. They recommend setting
a maximum neighborhood size in the range of 20 to 60. We
set the neighborhood size to 50 we also set that as the
minimum neighborhood size for each member of the groups we
considered for evaluation. Breese et al. [
            <xref ref-type="bibr" rid="ref6">6</xref>
            ] reported that
neighbors with higher similarity correlation with the target
user can be exceptionally more valuable as predictors than
those with the lower similarity values. We set this threshold
to 0.5 and we only consider the ones based on 5 or more
co-rated items.
4.2
          </p>
          <p>Group Prediction Aggregation</p>
          <p>Previous group recommender research has focused on
several group aggregation strategies for combining individual
predictions. We evaluate the three group aggregation
strategies which are outlined in section 2.1 as representative RGPMs.
We compare the performance of these three aggregation
strategies with respect to group characteristics: group size and the
degree of similarity within the group.
4.3</p>
          <p>Data Set</p>
          <p>To evaluate the accuracy of an aggregated predicted
rating for a group we use the MovieLens 100K ratings and 943
users data set. Simulated groups were created based on
different thresholds defined for the group descriptors. The two
descriptors we varied were group size and degree of
similarity among group members. We presume the same data set
that is used to create the simulated groups is the same data
set used to evaluate recommendation techniques.</p>
          <p>By varying the thresholds of the group descriptors used to
create the group threshold matrix we were able to represent
groups of di↵ erent characteristics, which we then used to
find and generate groups for testing. One aspect we wanted
to investigate is the a↵ ect of group homogeneity and size on
the di↵ erent aggregation methods used to predict a rating
score for a group using the baseline CF algorithms defined in
section 4.1. To answer this question we varied the threshold
for the similarity descriptor and then varied the size of the
group from 2 to 5. We defined three similarity levels: high,
medium and low similarity groups as outlined in Table 1
where the inner similarity correlation between any two users
i, j belonging to group G is calculated as defined in equation
1.</p>
          <p>To ensure significance of the calculated similarity
correlations we only consider user pairs that have at least 5 common
rated items. For the MovieLens data set used we have a total
of 444153 distinct correlations (943 taking two combinations
at a time). For the three similarity levels defined previously
the total correlation and average correlation are outlined in
Table 2.</p>
          <p>Table 3 reflects the GTM group generation statistics for
the underlying data set used in our evaluation. Total
combinations field indicate the number of possible group
combinations that can be formed giving user pairs that satisfy
our group size threshold descriptor. The valid groups field
indicates the number of possible groups that satisfy both
the size and similarity threshold whereas the testable groups
are valid groups with at least one identified test item as
described in section 3.2. As we increase the size of the groups
to be created the number of combinations the
implementation has to check increases significantly. We can also see
that the number of testable groups is large in comparison
to the number of groups used in actual user studies. As of
this writing and due to system restrictions we were able to
generate all testable groups for group size 2 and 3 across
all similarity levels, group size 4 for low and high similarity
level and group size 5 for the high similarity level.
4.4</p>
          <p>The Testing Framework</p>
          <p>The framework creates a Group Threshold Matrix based
on the group descriptor conditions defined. In our
implementation of this framework the group descriptors used to
define inputs for the group threshold matrix are the
useruser correlation and the number of co-rated items between
any user pair. This forms the group model element of the
testing framework. For the group formation element we
varied the groups size and for each group the similarity category,
5000 testable groups were identified (with at least one
common rating across group members). A predicted rating was
computed for each group member and those values were
aggregated to produce a final group predicted rating. Table 3
gives an overview of the number of di↵ erent group
combinations the framework needs to consider to identify valid,
and testable groups. The framework exploits the possible
combinations to identify groups where the group descriptors
defined are valid between every user pair belonging to that
group this is then depicted in the GTM.</p>
          <p>We then utilized the testing framework to assess the
predicted rating computed for a group based on the three
defined aggregation strategies in section 4.2. We compared
the group predicted rating calculated for the test item to
the actual rating using MAE and RMSE across the di↵ erent
aggregation methods.</p>
          <p>It is worth noting here that just like any recommendation
technique quality depends on the quality of the input data,
the quality of the generated test set depends on the quality of
the underlying individual ratings data set when it comes to
the ability to generate predictions. For example, prediction
accuracy and quality decrease due to sparsity in the original
data set.
5.</p>
          <p>RESULTS AND DISCUSSION</p>
          <p>Our evaluation goal is to test group recommendation based
on traditional memory-based collaborative filtering techniques,
in order to provide a basis of comparison that covers (1)
synthetic group formation for this type of approach, and (2)
group evaluation based on prediction rather than ranking.
We hypothesize that aggregation results will support
previous research for the aggregation strategies tested. In doing
so, we investigate the relationship between the group’s
coherence, size and the aggregation strategy used. Figures
1-6 reflect the MAE and RMSE for these evaluated
relaHigh Similarity &gt;= 0.5
Medium &gt;=0 &lt; 0.5
tionships. Examining the graphs for the groups with high
similarity levels, Figures 1 and 2 show that average
strategy and most happiness perform better than least misery.
We conducted a t-test to evaluate the results significance
and found that both MAE and RMSE for average and most
happiness strategies, across all group sizes, significantly
outperform the least misery strategy (p&lt;0.001 ). For group sizes
2 and 3 there was no significant di↵ erence between the
average and most happiness strategies (p&gt;0.01 ). For group
sizes 4 and 5 most happiness strategy performs better than
the average strategy (p&lt;0.001 ). Both least happiness and
average strategies performance decreases as the group size
grows. This indicates that a larger group of highly similar
people are as happy as their happiest member.</p>
          <p>Figures 3 and 4 show the RMSE and MAE for groups with
medium similarity levels. The average strategy performs
significantly better than most happiness and least misery
across group sizes 2,3 and 4 (p&lt;0.001 ). For the groups
of size 5 there was no significant di↵ erence between
average and most happiness strategies (p&gt;0.01 ). For groups
with medium similarity level the least misery strategy
performance is similar to the groups with high coherency levels.</p>
          <p>Figures 5 and 6 show the results for the groups with
low similarity level. Examining the RMSE and MAE in
these graphs the average strategy performs best across all
group sizes compared to the other two strategies. MAE and
RMSE for the average strategy for all group sizes with low
coherency had a statistically significant p value (p&lt;0.001 )
compared to both least misery and most happiness
strategies. Inconsistent with the groups with high coherency, for
groups with low coherency the most happiness performance
starts to decrease as the group size increases while the
performance of the least misery strategy starts to increase.</p>
          <p>These evaluation results indicate that in situations where
groups are formed with highly similar members most
happiness aggregation strategy would be best to model the RGPM
while for groups with medium to low coherency average
strategy would be best. These results using the 5000
synthesized groups for each category coincide with the results
reported by Gartrell using real subjects. Gartrell defined
groups based on the social relationships between the group
members. They identified three levels of social relationships
(couple, acquaintance and first-acquaintance) that might
exist between group members. In their study to compare the
performance of the three aggregation strategies across these
social ties, they reported that for the groups of two members
with a social tie defined as couple the most happiness
strategy outperforms the other two. For the acquaintance groups,
these groups had 3 members, the average strategy performs
best, while for the first-acquaintance, they had one group
with 12 members, the least misery strategy outperforms the
best. It is apparent that their results for the couple groups
performance is equivalent to our high-coherency groups, the
acquaintance groups maps to the medium-coherency groups
while the first-acquaintance groups follow the low-coherency
groups. Mastho↵ studies reported that people usually used
average strategy and least misery since they valued fairness
and preventing misery. It is worth noting that her studies
evaluated these strategies for groups of size 3 only without
any reference to coherency levels.</p>
          <p>CONCLUSION</p>
          <p>As group-based recommender systems become more
prevalent, there is an increasing need for evaluation approaches
and data sets to enable more extensive analysis of such
systems. In this paper we developed a group testing framework
that can help address the problem by automating group
formation resulting in generation of groups applicable for
testing in this domain. Our work provides novel
coverage in the group recommender evaluation space, considering
(1) focus on traditional memory-based collaborative
filtering, and (2) employs precise overlap across individual user
ratings for evaluating actual group preference. We
evaluated our framework with a foundational Collaborative
Filtering neighborhood-based approach, prediction accuracy,
and three representative group prediction aggregation
strategies. Our results show that for small-sized groups with
highsimilarity among their members average and most happiness
perform the best. For larger size groups with high-similarity
performs most happiness performs better. For the low and
medium similarity groups, average strategy has the best
performance. Overall, this work has helped to extend the
coverage of group recommender evaluation analysis, and we
expect this will provide a novel point of comparison for further
developments in this area. Going forward we plan to
evaluate various parameterizations of our testing framework such
as more flexible AGPM metrics (e.g. normalizing the ratings
of the individual users).</p>
        </sec>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>S.</given-names>
            <surname>Amer-yahia</surname>
          </string-name>
          , S. B.
          <string-name>
            <surname>Roy</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Chawla</surname>
            , G. Das, and
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Yu</surname>
          </string-name>
          . Group recommendation:
          <article-title>Semantics and e ciency</article-title>
          .
          <source>Proceedings of The Vldb Endowment</source>
          ,
          <volume>2</volume>
          :
          <fpage>754</fpage>
          -
          <lpage>765</lpage>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>L.</given-names>
            <surname>Ardissono</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Goy</surname>
          </string-name>
          , G. Petrone,
          <string-name>
            <given-names>M.</given-names>
            <surname>Segnan</surname>
          </string-name>
          , and
          <string-name>
            <given-names>P.</given-names>
            <surname>Torasso</surname>
          </string-name>
          . Intrigue:
          <article-title>Personalized recommendation of tourist attractions for desktop and hand held devices</article-title>
          .
          <source>Applied Artificial Intelligence</source>
          , pages
          <fpage>687</fpage>
          -
          <lpage>714</lpage>
          ,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>C.</given-names>
            <surname>Baccigalupo</surname>
          </string-name>
          and
          <string-name>
            <given-names>E.</given-names>
            <surname>Plaza</surname>
          </string-name>
          .
          <article-title>Poolcasting: A social web radio architecture for group customisation</article-title>
          .
          <source>In Proceedings of the Third International Conference on Automated Production of Cross Media Content for Multi-Channel Distribution</source>
          , pages
          <fpage>115</fpage>
          -
          <lpage>122</lpage>
          , Washington, DC, USA,
          <year>2007</year>
          . IEEE Computer Society.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>L.</given-names>
            <surname>Baltrunas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Makcinskas</surname>
          </string-name>
          , and
          <string-name>
            <given-names>F.</given-names>
            <surname>Ricci</surname>
          </string-name>
          .
          <article-title>Group recommendations with rank aggregation and collaborative filtering</article-title>
          .
          <source>In Proceedings of the fourth ACM conference on Recommender systems, RecSys '10</source>
          , pages
          <fpage>119</fpage>
          -
          <lpage>126</lpage>
          , New York, NY, USA,
          <year>2010</year>
          . ACM.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>S.</given-names>
            <surname>Berkovsky</surname>
          </string-name>
          and
          <string-name>
            <given-names>J.</given-names>
            <surname>Freyne</surname>
          </string-name>
          .
          <article-title>Group-based recipe recommendations: analysis of data aggregation strategies</article-title>
          .
          <source>In Proceedings of the fourth ACM conference on Recommender systems, RecSys '10</source>
          , pages
          <fpage>111</fpage>
          -
          <lpage>118</lpage>
          , New York, NY, USA,
          <year>2010</year>
          . ACM.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>J. S.</given-names>
            <surname>Breese</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Heckerman</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C. M.</given-names>
            <surname>Kadie</surname>
          </string-name>
          .
          <article-title>Empirical analysis of predictive algorithms for collaborative filtering</article-title>
          . In G. F. Cooper and S. Moral, editors,
          <source>Proceedings of the 14th Conference on Uncertainty in Artificial Intelligence</source>
          , pages
          <fpage>43</fpage>
          -
          <lpage>52</lpage>
          ,
          <year>1998</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>D. L.</given-names>
            <surname>Chao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Balthrop</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Forrest</surname>
          </string-name>
          .
          <article-title>Adaptive radio: achieving consensus using negative preferences</article-title>
          .
          <source>In Proceedings of the 2005 international ACM SIGGROUP conference on Supporting group work</source>
          ,
          <source>GROUP '05</source>
          , pages
          <fpage>120</fpage>
          -
          <lpage>123</lpage>
          , New York, NY, USA,
          <year>2005</year>
          . ACM.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Y.-L.</given-names>
            <surname>Chen</surname>
          </string-name>
          , L.-C. Cheng, and
          <string-name>
            <given-names>C.-N.</given-names>
            <surname>Chuang</surname>
          </string-name>
          .
          <article-title>A group recommendation system with consideration of interactions among group members</article-title>
          .
          <source>Expert Syst. Appl.</source>
          ,
          <volume>34</volume>
          :
          <fpage>2082</fpage>
          -
          <lpage>2090</lpage>
          ,
          <year>April 2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>A.</given-names>
            <surname>Crossen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Budzik</surname>
          </string-name>
          , and
          <string-name>
            <given-names>K. J.</given-names>
            <surname>Hammond</surname>
          </string-name>
          .
          <article-title>Flytrap: intelligent group music recommendation</article-title>
          .
          <source>In Proceedings of the 7th international conference on Intelligent user interfaces</source>
          ,
          <source>IUI '02</source>
          , pages
          <fpage>184</fpage>
          -
          <lpage>185</lpage>
          , New York, NY, USA,
          <year>2002</year>
          . ACM.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>M.</given-names>
            <surname>Gartrell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Xing</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Lv</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Beach</surname>
          </string-name>
          , R. Han,
          <string-name>
            <surname>S</surname>
          </string-name>
          . Mishra, and
          <string-name>
            <given-names>K.</given-names>
            <surname>Seada</surname>
          </string-name>
          .
          <article-title>Enhancing group recommendation by incorporating social relationship interactions</article-title>
          .
          <source>In Proceedings of the 16th ACM international conference on Supporting group work</source>
          ,
          <source>GROUP '10</source>
          , pages
          <fpage>97</fpage>
          -
          <lpage>106</lpage>
          , New York, NY, USA,
          <year>2010</year>
          . ACM.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>D.</given-names>
            <surname>Goren-Bar</surname>
          </string-name>
          and
          <string-name>
            <given-names>O.</given-names>
            <surname>Glinansky</surname>
          </string-name>
          .
          <article-title>Fit-recommend ing tv programs to family members</article-title>
          .
          <source>Computers &amp; Graphics</source>
          ,
          <volume>28</volume>
          (
          <issue>2</issue>
          ):
          <fpage>149</fpage>
          -
          <lpage>156</lpage>
          ,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>J.</given-names>
            <surname>Herlocker</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. A.</given-names>
            <surname>Konstan</surname>
          </string-name>
          , and
          <string-name>
            <surname>J. Riedl.</surname>
          </string-name>
          <article-title>An empirical analysis of design choices in neighborhood-based collaborative filtering algorithms</article-title>
          . Inf. Retr.,
          <volume>5</volume>
          :
          <fpage>287</fpage>
          -
          <lpage>310</lpage>
          ,
          <year>October 2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>A.</given-names>
            <surname>Jameson</surname>
          </string-name>
          .
          <article-title>More than the sum of its members: challenges for group recommender systems</article-title>
          .
          <source>In Proceedings of the working conference on Advanced visual interfaces</source>
          ,
          <source>AVI '04</source>
          , pages
          <fpage>48</fpage>
          -
          <lpage>54</lpage>
          , New York, NY, USA,
          <year>2004</year>
          . ACM.
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>A.</given-names>
            <surname>Jameson</surname>
          </string-name>
          and
          <string-name>
            <given-names>B.</given-names>
            <surname>Smyth</surname>
          </string-name>
          .
          <article-title>The adaptive web</article-title>
          . chapter Recommendation to groups, pages
          <fpage>596</fpage>
          -
          <lpage>627</lpage>
          . Springer-Verlag, Berlin, Heidelberg,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>J.</given-names>
            <surname>Mastho</surname>
          </string-name>
          <article-title>↵</article-title>
          . Group modeling:
          <article-title>Selecting a sequence of television items to suit a group of viewers. User Modeling</article-title>
          and
          <string-name>
            <surname>User-Adapted</surname>
            <given-names>Interaction</given-names>
          </string-name>
          ,
          <volume>14</volume>
          :
          <fpage>37</fpage>
          -
          <lpage>85</lpage>
          ,
          <year>February 2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>J.</given-names>
            <surname>Mastho</surname>
          </string-name>
          <article-title>↵ . Group recommender systems: Combining individual models</article-title>
          . In F. Ricci,
          <string-name>
            <given-names>L.</given-names>
            <surname>Rokach</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Shapira</surname>
          </string-name>
          , and P. B. Kantor, editors,
          <source>Recommender Systems Handbook</source>
          , pages
          <fpage>677</fpage>
          -
          <lpage>702</lpage>
          . Springer US,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>J. F.</given-names>
            <surname>McCarthy</surname>
          </string-name>
          .
          <article-title>Pocket restaurantfinder: A situated recommender system for groups</article-title>
          . pages
          <fpage>1</fpage>
          -
          <lpage>10</lpage>
          ,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>J. F.</given-names>
            <surname>McCarthy</surname>
          </string-name>
          and
          <string-name>
            <given-names>T. D.</given-names>
            <surname>Anagnost</surname>
          </string-name>
          .
          <article-title>Musicfx: an arbiter of group preferences for computer supported collaborative workouts</article-title>
          .
          <source>In CSCW, page 348</source>
          ,
          <year>2000</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>K.</given-names>
            <surname>McCarthy</surname>
          </string-name>
          , M. SalamA˜ ¸s,
          <string-name>
            <given-names>L.</given-names>
            <surname>Coyle</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>McGinty</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Smyth</surname>
          </string-name>
          , and
          <string-name>
            <given-names>P.</given-names>
            <surname>Nixon. Cats</surname>
          </string-name>
          :
          <article-title>A synchronous approach to collaborative group recommendation</article-title>
          . pages
          <fpage>86</fpage>
          -
          <lpage>91</lpage>
          ,
          <string-name>
            <surname>Melbourne</surname>
            <given-names>Beach</given-names>
          </string-name>
          , Florida, USA,
          <volume>11</volume>
          /05/
          <year>2006</year>
          2006. AAAI Press, AAAI Press.
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <surname>M. O'Connor</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Cosley</surname>
            ,
            <given-names>J. A.</given-names>
          </string-name>
          <string-name>
            <surname>Konstan</surname>
            , and
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Riedl</surname>
          </string-name>
          .
          <article-title>Polylens: a recommender system for groups of users</article-title>
          .
          <source>In Proceedings of the seventh conference on European Conference on Computer Supported Cooperative Work</source>
          , pages
          <fpage>199</fpage>
          -
          <lpage>218</lpage>
          , Norwell, MA, USA,
          <year>2001</year>
          . Kluwer Academic Publishers.
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>J. A.</given-names>
            <surname>Recio-Garcia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Jimenez-Diaz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. A.</given-names>
            <surname>Sanchez-Ruiz</surname>
          </string-name>
          , and
          <string-name>
            <given-names>B.</given-names>
            <surname>Diaz-Agudo</surname>
          </string-name>
          .
          <article-title>Personality aware recommendations to groups</article-title>
          .
          <source>In Proceedings of the third ACM conference on Recommender systems, RecSys '09</source>
          , pages
          <fpage>325</fpage>
          -
          <lpage>328</lpage>
          , New York, NY, USA,
          <year>2009</year>
          . ACM.
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>P.</given-names>
            <surname>Resnick</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Iacovou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Sushak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Bergstrom</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J.</given-names>
            <surname>Riedl</surname>
          </string-name>
          . Grouplens:
          <article-title>An open architecture for collaborative filtering of netnews</article-title>
          .
          <source>In 1994 ACM Conference on Computer Supported Collaborative Work Conference</source>
          , pages
          <fpage>175</fpage>
          -
          <lpage>186</lpage>
          ,
          <string-name>
            <surname>Chapel</surname>
            <given-names>Hill</given-names>
          </string-name>
          , NC, 10/
          <year>1994</year>
          1994. Association of Computing Machinery, Association of Computing Machinery.
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>C.</given-names>
            <surname>Senot</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Kostadinov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Bouzid</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Picault</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Aghasaryan</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C.</given-names>
            <surname>Bernier</surname>
          </string-name>
          .
          <article-title>Analysis of strategies for building group profiles</article-title>
          . In P. De Bra,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kobsa</surname>
          </string-name>
          , and D. Chin, editors,
          <source>User Modeling, Adaptation, and Personalization</source>
          , volume
          <volume>6075</volume>
          of Lecture Notes in Computer Science, pages
          <fpage>40</fpage>
          -
          <lpage>51</lpage>
          . Springer Berlin / Heidelberg,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>D.</given-names>
            <surname>Sprague</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Wu</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Tory</surname>
          </string-name>
          .
          <article-title>Music selection using the partyvote democratic jukebox</article-title>
          .
          <source>In Proceedings of the working conference on Advanced visual interfaces</source>
          ,
          <source>AVI '08</source>
          , pages
          <fpage>433</fpage>
          -
          <lpage>436</lpage>
          , New York, NY, USA,
          <year>2008</year>
          . ACM.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>