Evaluating Group Recommendation Strategies in Memory-Based Collaborative Filtering Nadia A. Najjar David C. Wilson University of North Carolina at Charlotte University of North Carolina at Charlotte 9201 University City Blvd. 9201 University City Blvd. Charlotte, NC, USA Charlotte, NC, USA nanajjar@uncc.edu davils@uncc.edu ABSTRACT given to the issue of group recommendation [14, 4]. Group recommender systems must manage and balance preferences Group recommendation presents significant challenges in evo- from individuals across a group of users with a common pur- lving best practice approaches to group modeling, but even pose, in order to tailor choices, options, or information to the moreso in dataset collection for testing and in developing group as a whole. Group recommendations can help to sup- principled evaluation approaches across groups of users. Ear- port a variety of tasks and activities across domains that ly research provided more limited, illustrative evaluations for have a social aspect with shared-consumption needs. Com- group recommender approaches, but recent work has been mon examples arise in social entertainment: finding a movie exploring more comprehensive evaluative techniques. This or a television show for family night, date night, or the like paper describes our approach to evaluate group-based rec- [22, 11, 23]; finding a restaurant for dinner with work col- ommenders using data sets from traditional single-user col- leagues, family, or friends [17]; finding a dish to cook that laborative filtering systems. The approach focuses on clas- will satisfy the whole group [5], the book the book club sic memory-based approaches to collaborative filtering, ad- should read next, the travel destination for the next family dressing constraints imposed by sparsity in the user-item vacation [19, 2, 13], or the songs to play at any social event matrix. In generating synthetic groups, we model ‘actual’ or at any shared public space [24, 3, 7, 9, 18]. group preferences for evaluation by precise rating agreement Group recommenders have been distinguished from single among members. We evaluate representative group aggrega- user recommenders primarily by their need for an aggre- tion strategies in this context, providing a novel comparison gation mechanism to represent the group. A considerable point for earlier illustrative memory-based results and for amount of research in group-based recommenders concen- more recent model-based work, as well as for models of ac- trates on the techniques used for a recommendation strat- tual group preference in evaluation. egy, and two main group recommendation strategies have been proposed [14]. The first strategy merges the individual Categories and Subject Descriptors profiles of the group members into one group representative H.4 [Information Systems Applications]: Miscellaneous; profile, while the second strategy merges the recommenda- D.2.8 [Software Engineering]: Metrics—complexity mea- tion lists or predictions computed for each group member sures, performance measures into one recommendation list presented to the group. Both strategies utilize recommendation approaches validated for General Terms individual users, leaving the aggregation strategy as a dis- tinguishing area of study applicable for group-based recom- Algorithms, Experimentation, Standardization menders. Group recommendation presents significant challenges in Keywords evolving best practice approaches to group modeling, but Evaluation, Group recommendation, Collaborative filtering, even moreso in dataset collection for testing and in develop- Memory-based ing principled evaluation approaches across groups of users. Early research provided more limited, illustrative evalua- tions for group recommender approaches (e.g., [18, 20, 17]), 1. INTRODUCTION but recent work has been exploring more comprehensive Recommender systems have traditionally focused on the evaluative techniques (e.g., [4, 8, 1]). Broadly, evaluations individual user as a target for personalized information fil- have been conducted either via live user studies or via syn- tering. As the field has grown, increasing attention is being thetic dataset analysis. In both types of evaluation, deter- mining an overall group preference to use as ground truth in measuring recommender accuracy presents a complementary  aggregation problem to group modeling for generating rec-  ommendations. Based on group interaction and group choice  outcomes, either a gestalt decision is rendered for the group  as a whole, or individual preferences are elicited and com-  bined to represent the overall group preference. The former  WOODSTOCK ’97 El Paso, Texas USA lends itself to user studies in which the decision emerges from  43 group discussion and interaction, while the latter lends itself evaluation using the proposed framework. Finally section 5 to synthetic group analysis. Currently, the limited deploy- outlines our results and discussion. ment of group recommender systems coupled with the addi- tional overhead of bringing groups together for user studies has constrained the availability of data sets that can be used 2. RELATED WORK to evaluate group based recommenders. Thus as with other Previous research that involves evaluation of group recom- group evaluation e↵orts [4], we adopt the approach of gen- mendation approaches falls into two primary categories. The erating synthetic groups for larger scale evaluation. first category employs synthetic datasets, generated from ex- It is important to note that there are two distinct group isting single-user datasets (typically MovieLens1 ). The sec- modeling issues at play. The first is how to model a group ond category focuses on user studies. for the purpose of making recommendations (i.e., what a group’s preference outcome will be). We refer to this as the 2.1 Group Aggregation Strategies recommendation group preference model (RGPM). The sec- Various group modeling strategies for making recommen- ond is how to determine an “actual” group preference based dations have been proposed and tested to aggregate the in- on outcomes in user data, in order to represent ground truth dividual group user’s preferences into a recommendation for for evaluation purposes (i.e., what a group’s preference out- the group. Mastho↵ [16] evaluated eleven strategies inspired come was). We refer to this as the actual group preference from social choice theory. Three representative strategies are model (AGPM). For example, it might be considered a triv- average strategy, least misery, and most happiness. ial recommendation if each group member had previously • Average Strategy: this is the basic group aggregation given a movie the same strong rating across the board. How- strategy that assumes equal influence among group ever, such an agreement point is ideal for evaluating whether members and calculates the average rating of the group that movie should have been recommended for the group. members for any given item as the predicted rating. In evaluating group-based recommenders, the primary con- Let n be the number of users in a group and rij be the text includes choices made about: rating of user j for item i, then the group rating for • the underlying recommendation strategy (e.g., content- item i is computed as follows: based, collaborative memory-based or model-based) Pn j=1 rji • group modeling for making recommednations — RGPM Gri = (1) (e.g., least misery) n • determining actual group preferences for evaluative com- • Least Misery Strategy: this aggregation strategy is ap- parison to system recommendations — AGPM (e.g., plicable in situations where the recommender system choice aggregation) needs to avoid presenting an item that was really dis- • choices about metrics for assessment (e.g., ranking, liked by any of the group members, i.e., that goal is to rating value). please the least happy member. The predicted rating is calculated as the lowest rating of for any given item Exploring the group recommendation space involves evalu- among group members and computed as follows: ation across a variety of such contexts. To date, we are not aware of a larger-scale group rec- Gri = min rji (2) j ommender evaluation using synthetic data sets that (1) fo- cuses on traditional memory-based collaborative filtering or • Most Happiness: this aggregation strategy is the oppo- (2) employs precise overlap across individual user ratings site of the least misery strategy. It applies in situations for evaluating actual group preference. Given the founda- where the group is as happy as their happiest member tional role of classic user-based [22] collaborative filtering and computed as follows: in recommender systems, we are interested in understand- ing the behavior of group recommendation in this context Gri = max rji (3) j as a comparative baseline for evaluation. Given that addi- tional inference to determine “ground truth” preference for 2.2 Evaluation with Synthetic Groups synthetic groups can potentially decrease precision in eval- Recent work by Baltrunas [4] used simulated groups to uation, we are interested in comparing results when group compare aggregation strategies of ranked lists produced by members agree precisely in original ratings data. a model based collaborative filtering methodology using ma- In this paper, we focus on traditional memory-based ap- trix factorization with gradient descent (SVD). This ap- proaches to collaborative filtering, addressing constraints proach addresses sparsity issues for user similarity. The imposed by sparsity in the user-item matrix. In generating MovieLens data set was used to simulate groups of di↵er- valid synthetic groups, we model actual group preferences ent sizes (2, 3, 4, 8) and di↵erent degrees of similarity (high, by direct rating agreement among members. Prediction ac- random). They employed a ranking evaluation metric, mea- curacy is measured using root mean squared error and mean suring the e↵ectiveness of the predicted rank list using Nor- average error. We evaluate the performance of three repre- malized Discounted Cumulative Gain (nDCG). To account sentative group aggregation strategies (average, least misery, for the sparsity in the rating matrix nDCG was computed most happiness) [15] in this context, providing a novel com- only over the items that appeared in the target user test set. parison point for earlier illustrative memory-based results, The e↵ectiveness of the group recommendation was mea- for more recent model-based work, and for models of ac- sured as the average e↵ectiveness (nDCG) of the group mem- tual group preference in evaluation. This paper is organized bers where a higher nDCG indicated better performance. as follows: section 2 overviews related researches. Section 3 1 outlines our group testing framework. Section 4 provides the www.movielens.org 44 Chen et al. [8] also used simulated groups and addressed tion strategies, including aggregated models and aggregated the sparsity in user-rating matrix by predicting the missing predictions. Their aggregated predictions strategy combined ratings of items belonging in the union set of items rated by the predictions produced for each of the group members into group members. They simulated 338 random groups from one prediction using a weighted, linear combination of these the MovieLens data set and used it for evaluating the use predictions. Evaluation consisted of 170 users where a 108 of Genetic Algorithms to exploit single user ratings as well of them belonged to a family group with size ranges between as item ratings given by groups to model group interactions 1 and 4. and find suitable items that can be considered neighbors in their implemented neighborhood-based CF. 2.4 Establishing Group Preference Amer-Yahia et al. [1] also simulated groups from Movie- A major question that must be addressed in evaluating Lens. The simulated groups where used to measure the per- group recommender systems is how to establish the actual formance of di↵erent strategies centered around a top-k TA group preference in order to compare accuracy with system algorithm. To generate groups a similarity level was spec- predictions. Previous work by [4, 8, 1] simulated groups ified, groups were formed from users that had a similarity from single-user data sets. Their simulated group creation value within a 0.05 margin. They varied the group similar- was limited to groups of di↵erent sizes (representing small, ity between 0.3, 0.5, 0.7 and 0.9 and the size 3, 5 and 8. It medium and large) with certain degrees of similarity (ran- was unclear how actual group ratings were established for dom, homogeneous and heterogeneous ). Chen et al. [8] the simulated groups or how many groups were created. used a baseline aggregation as the ground truth while [4] compares the e↵ectiveness of the group-based recommenda- 2.3 Evaluation with User Studies tion to the e↵ectiveness of the individual recommendations made to each member in the group. This led to our work Mastho↵ [15] employed user studies, not to evaluate spe- in investigating ways to create synthesized groups from the cific techniques, but to determine which group aggregation most commonly used CF single-user data sets taking into strategies people actually use. Thirty-nine human subjects consideration the ability to identify and establish ground were given the same individual rating sets from three people truth. We propose a novel Group Testing Framework that on a collection of video clips. Subjects were asked to decide allows for the creation of synthesized groups that can be which clips the group should see given time limitations for used for testing in memory-based CF recommenders. In the viewing only 1, 2, 3, 4, 5, 6, or 7 clips, respectively. In addi- remainder of the paper we give an overview of our proposed tion, why they made that selection. Results indicated that Group Testing Framework and we report on the evaluations people particularly use the following strategies: Average, we conducted using this framework. Average Without Misery and Least Misery. Overall, larger-scale synthetic evaluations for group rec- PolyLens [20] evaluated qualitative feedback and changes ommendation have not focused on traditional memory-based in user behavior for a basic Least Misery aggregation strat- approaches. This may be because it is cumbersome to ad- egy. Results showed that while users liked and used group dress group generation, given sparsity constraints in the recommendation, they disliked the minimize misery strat- user-item matrix. Moreover, only limited attention has been egy They attributed this to the fact that this social value given to evaluation based on predictions, rather than rank- function is more applicable to groups of smaller sizes. ing. Our evalution approach addresses these issues. Amer-Yahia et al. [1] also ran a user study using Ama- zon’s Mechanical Turk users, they had a total of 45 users where various groups were formed of sizes 3 and 8 to repre- 3. GROUP TESTING FRAMEWORK sent small and large groups. They established an evaluation We have developed a group testing framework in order baseline by generating a recommendation list using four im- to support evaluation of group recommender approaches. plemented strategies. The resulting lists are combined into The framework is used to generate synthetic groups that are a single group list of distinct items and were presented to parametrized to test di↵erent group contexts. This enables the users for evaluation where a relevance score of 1 was exploration of various parameters, such as group diversity. given if the user considered the item suitable for the group The testing framework consists of two main components. and 0 otherwise. They employed an nDCG measure to eval- The first component is a group model that defines specific uate their proposed prediction lists consensus function. The group characteristics, such as group coherence. The sec- nDCG measure was computed for each group member and ond component is a group formation mechanism that applies the average was considered the e↵ectiveness of the group the model to identify compatible groups from an underlying recommendation. single-user data set, according to outcome parameters such Other work considers social relationships and interactions as the number of groups to generate. among group members when aggregating the predictions [10, 8, 21]. They model member interactions, social relation- 3.1 Group Model ships, domain expertise, and dissimilarity among the group In simulating groups of users, a given group will be defined members when choosing a group decision strategy. For ex- based on certain constraints and characteristics, or group ample, Recio-Garcia et al. [21] described a group recom- model. For example, we might want to test recommenda- mender system that takes into account the personality types tions based on di↵erent levels of intra-group similarity or for the group members. diversity. For a given dataset, the group model defines the Berkovsky and Freyne [5] reported better performance in space of potential groups for evaluation. While beyond the the recipe recommendation domain when aggregating the scope of this paper, we note that the group model for evalu- user profiles rather than aggregating individual user predic- ation could include inter-group constraints (diversity across tions. They implemented a memory-based recommendation groups) as well as intra-group constraints (similarity within approach comparing the performance of four recommenda- groups). 45 3.1.1 Group Descriptors corresponding group. We note that there are many poten- Gartrell et al. [10] use the term “group descriptors” for tial approaches to model agreement among group members. specific individual group characteristics (social, expertise, In this implementation we choose the most straightforward dissimilarity) to be accounted for within a group model. We approach, where the average rating among group members adopt the group descriptor convention to refer to any quan- is equal to the individual group member rating for that item tifiable group characteristic that can reflect group structure as a baseline for evaluation. We do not currently eliminate and formation. Some of these group descriptors that can “universally popular” items, but enough test items are iden- reflect group structure are user-user correlation, number of tified that we do not expect such items to make a significant co-rated items between users and demographics such as age di↵erence. A common practice in evaluation frameworks is di↵erence. We use these group descriptors to identify rela- to divide data sets into test and target data sets. In this tionships between user pairs within a single user data set. framework the test data set for each group would consist of the identified common item or items for that group. 3.1.2 Group Threshold Matrix A significant set of typical group descriptors can be evalu- 4. EVALUATION ated on a pairwise basis between group members. For exam- ple, group coherence can be defined as a minimum degree of 4.1 Baseline Collaborative Filtering similarity between group members, or a minimum number of commonly rated items. We employ such pairwise group de- We implement the most prevalent memory-based CF algo- scriptors as a foundational element in generating candidate rithm, neighborhood-based CF algorithm [12, 22]. The basis groups for evaluation. We operationalize these descriptors for this algorithm is to calculate the similarity, wab , which in a binary matrix data structure, referred to as the Group reflects the correlation between two users a and b. We mea- Threshold Matrix (GTM). The GTM is a square n ⇥ n sym- sure this correlation by computing the Pearson correlation metric matrix, where n is the number of users in the system, defined as: and the full symmetric matrix is employed for group gener- Pn ation. A single row or column corresponds to a single user, [(rai ra )(rbi rb )] wab = Pn i=1 p P (4) and a binary cell value represents whether the full set of (r r a )2 n r b )2 i=1 ai i=1 (rbi pairwise group descriptors holds between the respectively paired users. To generate predictions a subset of the nearest neighbors To populate the GTM, pairwise group descriptors are eval- of the active user are chosen based on their correlation. uated across each user pair in a given single-user dataset. We then calculate a weighted aggregate of their ratings The GTM enables efficient storage and operations for test- to generate predictions for that user. We use the following ing candidate group composition. A simple lookup indicates formula to calculate the prediction of item i for user a: whether two users can group. A bitwise-AND operation on Pn those two user rows indicates which (and how many) other b=1 [(r bi rb ) · wab ] users they can group with together. A further bitwise-AND pai = ra + Pn (5) w b=1 ab with a third user indicates which (and how many) other users the three can group with together, and so on. Com- Herlocker et al. [12] noted that setting a maximum for the posing such row- (or column-) wise operations provides an neighborhood size less than 20 negatively a↵ects the accu- efficient foundation for a generate-and-test approach to cre- racy of the recommender systems. They recommend setting ating candidate groups from pairwise group descriptors. a maximum neighborhood size in the range of 20 to 60. We set the neighborhood size to 50 we also set that as the min- 3.2 Group Formation imum neighborhood size for each member of the groups we Once the group model is constructed it can be applied considered for evaluation. Breese et al. [6] reported that to generate groups from any common CF user-rating data neighbors with higher similarity correlation with the target models as the underlying data source. The group formation user can be exceptionally more valuable as predictors than mechanism applies the set of group descriptors to gener- those with the lower similarity values. We set this threshold ate synthetic groups that are valid for the group model. It to 0.5 and we only consider the ones based on 5 or more conducts an exhaustive search through the space of poten- co-rated items. tial groups, employing heuristic pruning to limit the number of groups considered. Initially, individual users are filtered 4.2 Group Prediction Aggregation based on group descriptors that can be applied to single Previous group recommender research has focused on sev- users (e.g., minimum number of items rated). The GTM eral group aggregation strategies for combining individual is generated for remaining users. Baseline pairwise group predictions. We evaluate the three group aggregation strate- descriptors are then used to eliminate some individual users gies which are outlined in section 2.1 as representative RGPMs. from further consideration (e.g., minimum group size). The We compare the performance of these three aggregation strate- GTM is used to generate-and-test candidate groups for a gies with respect to group characteristics: group size and the given group size. degree of similarity within the group. To address the issue of modeling actual group preferences for evaluating system predictions, the framework is tuned to 4.3 Data Set identify groups where all group members gave at least one To evaluate the accuracy of an aggregated predicted rat- co-rated item the exact same rating among all group mem- ing for a group we use the MovieLens 100K ratings and 943 bers. Such identified “test items” become candidates for the users data set. Simulated groups were created based on dif- testing set in the evaluation process in conjunction with the ferent thresholds defined for the group descriptors. The two 46 Table 1: Degrees of Group Similarity Similarity Definition Level 8i, j 2 G High wij 0.5 Medium 0.5 > wij 0 Low 0 > wij Table 2: Similarity Statistics for Test Data Set Degree of Number of Valid Average User-User Similarity Correlations Similarity High 39,650 0.65 Medium 192,522 0.22 Low 95,739 -0.25 descriptors we varied were group size and degree of similar- ity among group members. We presume the same data set that is used to create the simulated groups is the same data Figure 1: RMSE - High degree of similarity. set used to evaluate recommendation techniques. By varying the thresholds of the group descriptors used to create the group threshold matrix we were able to represent mentation of this framework the group descriptors used to groups of di↵erent characteristics, which we then used to define inputs for the group threshold matrix are the user- find and generate groups for testing. One aspect we wanted user correlation and the number of co-rated items between to investigate is the a↵ect of group homogeneity and size on any user pair. This forms the group model element of the the di↵erent aggregation methods used to predict a rating testing framework. For the group formation element we var- score for a group using the baseline CF algorithms defined in ied the groups size and for each group the similarity category, section 4.1. To answer this question we varied the threshold 5000 testable groups were identified (with at least one com- for the similarity descriptor and then varied the size of the mon rating across group members). A predicted rating was group from 2 to 5. We defined three similarity levels: high, computed for each group member and those values were ag- medium and low similarity groups as outlined in Table 1 gregated to produce a final group predicted rating. Table 3 where the inner similarity correlation between any two users gives an overview of the number of di↵erent group combi- i, j belonging to group G is calculated as defined in equation nations the framework needs to consider to identify valid, 1. and testable groups. The framework exploits the possible To ensure significance of the calculated similarity correla- combinations to identify groups where the group descriptors tions we only consider user pairs that have at least 5 common defined are valid between every user pair belonging to that rated items. For the MovieLens data set used we have a total group this is then depicted in the GTM. of 444153 distinct correlations (943 taking two combinations We then utilized the testing framework to assess the pre- at a time). For the three similarity levels defined previously dicted rating computed for a group based on the three de- the total correlation and average correlation are outlined in fined aggregation strategies in section 4.2. We compared Table 2. the group predicted rating calculated for the test item to Table 3 reflects the GTM group generation statistics for the actual rating using MAE and RMSE across the di↵erent the underlying data set used in our evaluation. Total com- aggregation methods. binations field indicate the number of possible group com- It is worth noting here that just like any recommendation binations that can be formed giving user pairs that satisfy technique quality depends on the quality of the input data, our group size threshold descriptor. The valid groups field the quality of the generated test set depends on the quality of indicates the number of possible groups that satisfy both the underlying individual ratings data set when it comes to the size and similarity threshold whereas the testable groups the ability to generate predictions. For example, prediction are valid groups with at least one identified test item as de- accuracy and quality decrease due to sparsity in the original scribed in section 3.2. As we increase the size of the groups data set. to be created the number of combinations the implementa- tion has to check increases significantly. We can also see 5. RESULTS AND DISCUSSION that the number of testable groups is large in comparison Our evaluation goal is to test group recommendation based to the number of groups used in actual user studies. As of on traditional memory-based collaborative filtering techniques, this writing and due to system restrictions we were able to in order to provide a basis of comparison that covers (1) generate all testable groups for group size 2 and 3 across synthetic group formation for this type of approach, and (2) all similarity levels, group size 4 for low and high similarity group evaluation based on prediction rather than ranking. level and group size 5 for the high similarity level. We hypothesize that aggregation results will support previ- ous research for the aggregation strategies tested. In doing 4.4 The Testing Framework so, we investigate the relationship between the group’s co- The framework creates a Group Threshold Matrix based herence, size and the aggregation strategy used. Figures on the group descriptor conditions defined. In our imple- 1-6 reflect the MAE and RMSE for these evaluated rela- 47 Table 3: Group Threshold Matrix Statistics 2 3 4 5 High Similarity >= 0.5 Total Combinations 39,650 1,351,657 40,435,741 1,087,104,263 Valid Groups 39,650 226,952 417,948 390,854 Testable Groups 37,857 129,826 129,851 71,441 Medium >=0 < 0.5 Total combinations 192,522 30,379,236 3,942,207,750 434,621,369,457 Valid groups 192,522 17,097,527 Testable groups 187,436 11,482,472 Low similarity < 0.0 Total combinations 95,739 7,074,964 421,651,608 21,486,449,569 Valid groups 95,739 1,641,946 6,184,151 Testable groups 87,642 470,257 283,676 Figure 2: MAE - High degree of similarity. Figure 3: RMSE - Medium degree of similarity. tionships. Examining the graphs for the groups with high similarity levels, Figures 1 and 2 show that average strat- egy and most happiness perform better than least misery. We conducted a t-test to evaluate the results significance and found that both MAE and RMSE for average and most happiness strategies, across all group sizes, significantly out- perform the least misery strategy (p<0.001 ). For group sizes 2 and 3 there was no significant di↵erence between the av- erage and most happiness strategies (p>0.01 ). For group sizes 4 and 5 most happiness strategy performs better than the average strategy (p<0.001 ). Both least happiness and average strategies performance decreases as the group size grows. This indicates that a larger group of highly similar people are as happy as their happiest member. Figures 3 and 4 show the RMSE and MAE for groups with medium similarity levels. The average strategy performs significantly better than most happiness and least misery across group sizes 2,3 and 4 (p<0.001 ). For the groups of size 5 there was no significant di↵erence between aver- age and most happiness strategies (p>0.01 ). For groups with medium similarity level the least misery strategy per- formance is similar to the groups with high coherency levels. Figures 5 and 6 show the results for the groups with low similarity level. Examining the RMSE and MAE in these graphs the average strategy performs best across all Figure 4: MAE - Medium degree of similarity. group sizes compared to the other two strategies. MAE and RMSE for the average strategy for all group sizes with low 48 Figure 5: RMSE - Low degree of similarity. Figure 6: MAE - Low degree of similarity. coherency had a statistically significant p value (p<0.001 ) testing in this domain. Our work provides novel cover- compared to both least misery and most happiness strate- age in the group recommender evaluation space, considering gies. Inconsistent with the groups with high coherency, for (1) focus on traditional memory-based collaborative filter- groups with low coherency the most happiness performance ing, and (2) employs precise overlap across individual user starts to decrease as the group size increases while the per- ratings for evaluating actual group preference. We evalu- formance of the least misery strategy starts to increase. ated our framework with a foundational Collaborative Fil- These evaluation results indicate that in situations where tering neighborhood-based approach, prediction accuracy, groups are formed with highly similar members most happi- and three representative group prediction aggregation strate- ness aggregation strategy would be best to model the RGPM gies. Our results show that for small-sized groups with high- while for groups with medium to low coherency average similarity among their members average and most happiness strategy would be best. These results using the 5000 syn- perform the best. For larger size groups with high-similarity thesized groups for each category coincide with the results performs most happiness performs better. For the low and reported by Gartrell using real subjects. Gartrell defined medium similarity groups, average strategy has the best per- groups based on the social relationships between the group formance. Overall, this work has helped to extend the cov- members. They identified three levels of social relationships erage of group recommender evaluation analysis, and we ex- (couple, acquaintance and first-acquaintance) that might ex- pect this will provide a novel point of comparison for further ist between group members. In their study to compare the developments in this area. Going forward we plan to evalu- performance of the three aggregation strategies across these ate various parameterizations of our testing framework such social ties, they reported that for the groups of two members as more flexible AGPM metrics (e.g. normalizing the ratings with a social tie defined as couple the most happiness strat- of the individual users). egy outperforms the other two. For the acquaintance groups, these groups had 3 members, the average strategy performs 7. REFERENCES best, while for the first-acquaintance, they had one group [1] S. Amer-yahia, S. B. Roy, A. Chawla, G. Das, and with 12 members, the least misery strategy outperforms the C. Yu. Group recommendation: Semantics and best. It is apparent that their results for the couple groups efficiency. Proceedings of The Vldb Endowment, performance is equivalent to our high-coherency groups, the 2:754–765, 2009. acquaintance groups maps to the medium-coherency groups [2] L. Ardissono, A. Goy, G. Petrone, M. Segnan, and while the first-acquaintance groups follow the low-coherency P. Torasso. Intrigue: Personalized recommendation of groups. Mastho↵ studies reported that people usually used tourist attractions for desktop and hand held devices. average strategy and least misery since they valued fairness Applied Artificial Intelligence, pages 687–714, 2003. and preventing misery. It is worth noting that her studies [3] C. Baccigalupo and E. Plaza. Poolcasting: A social evaluated these strategies for groups of size 3 only without web radio architecture for group customisation. In any reference to coherency levels. Proceedings of the Third International Conference on Automated Production of Cross Media Content for 6. CONCLUSION Multi-Channel Distribution, pages 115–122, As group-based recommender systems become more preva- Washington, DC, USA, 2007. IEEE Computer Society. lent, there is an increasing need for evaluation approaches [4] L. Baltrunas, T. Makcinskas, and F. Ricci. Group and data sets to enable more extensive analysis of such sys- recommendations with rank aggregation and tems. In this paper we developed a group testing framework collaborative filtering. In Proceedings of the fourth that can help address the problem by automating group ACM conference on Recommender systems, RecSys formation resulting in generation of groups applicable for ’10, pages 119–126, New York, NY, USA, 2010. ACM. 49 [5] S. Berkovsky and J. Freyne. Group-based recipe B. Smyth, and P. Nixon. Cats: A synchronous recommendations: analysis of data aggregation approach to collaborative group recommendation. strategies. In Proceedings of the fourth ACM pages 86–91, Melbourne Beach, Florida, USA, conference on Recommender systems, RecSys ’10, 11/05/2006 2006. AAAI Press, AAAI Press. pages 111–118, New York, NY, USA, 2010. ACM. [20] M. O’Connor, D. Cosley, J. A. Konstan, and J. Riedl. [6] J. S. Breese, D. Heckerman, and C. M. Kadie. Polylens: a recommender system for groups of users. Empirical analysis of predictive algorithms for In Proceedings of the seventh conference on European collaborative filtering. In G. F. Cooper and S. Moral, Conference on Computer Supported Cooperative Work, editors, Proceedings of the 14th Conference on pages 199–218, Norwell, MA, USA, 2001. Kluwer Uncertainty in Artificial Intelligence, pages 43–52, Academic Publishers. 1998. [21] J. A. Recio-Garcia, G. Jimenez-Diaz, A. A. [7] D. L. Chao, J. Balthrop, and S. Forrest. Adaptive Sanchez-Ruiz, and B. Diaz-Agudo. Personality aware radio: achieving consensus using negative preferences. recommendations to groups. In Proceedings of the third In Proceedings of the 2005 international ACM ACM conference on Recommender systems, RecSys SIGGROUP conference on Supporting group work, ’09, pages 325–328, New York, NY, USA, 2009. ACM. GROUP ’05, pages 120–123, New York, NY, USA, [22] P. Resnick, N. Iacovou, M. Sushak, P. Bergstrom, and 2005. ACM. J. Riedl. Grouplens: An open architecture for [8] Y.-L. Chen, L.-C. Cheng, and C.-N. Chuang. A group collaborative filtering of netnews. In 1994 ACM recommendation system with consideration of Conference on Computer Supported Collaborative interactions among group members. Expert Syst. Work Conference, pages 175–186, Chapel Hill, NC, Appl., 34:2082–2090, April 2008. 10/1994 1994. Association of Computing Machinery, [9] A. Crossen, J. Budzik, and K. J. Hammond. Flytrap: Association of Computing Machinery. intelligent group music recommendation. In [23] C. Senot, D. Kostadinov, M. Bouzid, J. Picault, Proceedings of the 7th international conference on A. Aghasaryan, and C. Bernier. Analysis of strategies Intelligent user interfaces, IUI ’02, pages 184–185, for building group profiles. In P. De Bra, A. Kobsa, New York, NY, USA, 2002. ACM. and D. Chin, editors, User Modeling, Adaptation, and [10] M. Gartrell, X. Xing, Q. Lv, A. Beach, R. Han, Personalization, volume 6075 of Lecture Notes in S. Mishra, and K. Seada. Enhancing group Computer Science, pages 40–51. Springer Berlin / recommendation by incorporating social relationship Heidelberg, 2010. interactions. In Proceedings of the 16th ACM [24] D. Sprague, F. Wu, and M. Tory. Music selection international conference on Supporting group work, using the partyvote democratic jukebox. In GROUP ’10, pages 97–106, New York, NY, USA, Proceedings of the working conference on Advanced 2010. ACM. visual interfaces, AVI ’08, pages 433–436, New York, [11] D. Goren-Bar and O. Glinansky. Fit-recommend ing NY, USA, 2008. ACM. tv programs to family members. Computers & Graphics, 28(2):149 – 156, 2004. [12] J. Herlocker, J. A. Konstan, and J. Riedl. An empirical analysis of design choices in neighborhood-based collaborative filtering algorithms. Inf. Retr., 5:287–310, October 2002. [13] A. Jameson. More than the sum of its members: challenges for group recommender systems. In Proceedings of the working conference on Advanced visual interfaces, AVI ’04, pages 48–54, New York, NY, USA, 2004. ACM. [14] A. Jameson and B. Smyth. The adaptive web. chapter Recommendation to groups, pages 596–627. Springer-Verlag, Berlin, Heidelberg, 2007. [15] J. Mastho↵. Group modeling: Selecting a sequence of television items to suit a group of viewers. User Modeling and User-Adapted Interaction, 14:37–85, February 2004. [16] J. Mastho↵. Group recommender systems: Combining individual models. In F. Ricci, L. Rokach, B. Shapira, and P. B. Kantor, editors, Recommender Systems Handbook, pages 677–702. Springer US, 2011. [17] J. F. McCarthy. Pocket restaurantfinder: A situated recommender system for groups. pages 1–10, 2002. [18] J. F. McCarthy and T. D. Anagnost. Musicfx: an arbiter of group preferences for computer supported collaborative workouts. In CSCW, page 348, 2000. [19] K. McCarthy, M. SalamÃş, L. Coyle, L. McGinty, 50