Case Study Evaluation of Mahout as a Recommender Platform Carlos E. Seminario David C. Wilson Software and Information Systems Dept. Software and Information Systems Dept. University of North Carolina Charlotte University of North Carolina Charlotte cseminar@uncc.edu davils@uncc.edu ABSTRACT OracleAS Personalization1 ), code libraries can be adapted, Various libraries have been released to support the devel- or a platform may be selected and tailored to suit (e.g., opment of recommender systems for some time, but it is LensKit2 , MymediaLite3 , Apache Mahout4 , etc.). In some only relatively recently that larger scale, open-source plat- cases, a combination of these approaches will be employed. forms have become readily available. In the context of such For many projects, and particularly in the research con- platforms, evaluation tools are important both to verify and text, the ideal situation is to find an open-source platform validate baseline platform functionality, as well as to pro- with many active contributors that provides a rich and var- vide support for testing new techniques and approaches de- ied set of recommender system functions that meets all or veloped on top of the platform. We have adopted Apache most of the baseline development requirements. Short of Mahout as an enabling platform for our research and have finding this ideal solution, some minor customization to an faced both of these issues in employing it as part of our already existing system may be the best approach to meet work in collaborative filtering. This paper presents a case the specific development requirements. Various libraries have study of evaluation focusing on accuracy and coverage eval- been released to support the development of recommender uation metrics in Apache Mahout, a recent platform tool systems for some time, but it is only relatively recently that provides support for recommender system application that larger scale, open-source platforms have become readily development. As part of this case study, we developed a new available. In the context of such platforms, evaluation tools metric combining accuracy and coverage in order to evaluate are important both to verify and validate baseline platform functional changes made to Mahout’s collaborative filtering functionality, as well as to provide support for testing new algorithms. techniques and approaches developed on top of the platform. We have adopted Apache Mahout as an enabling platform for our research and have faced both of these issues in em- Categories and Subject Descriptors ploying it as part of our work in collaborative filtering rec- H.3.3 [Information Storage and Retrieval]: Information ommenders. Search and Retrieval–Information filtering This paper presents a case study of evaluation for rec- ommender systems in Apache Mahout, focusing on metrics for accuracy and coverage. We have developed functional General Terms changes to the baseline Mahout collaborative filtering algo- Algorithms, Experimentation, Measurement rithms to meet our research purposes, and this paper exam- ines evaluation both from the standpoint of tools for baseline Keywords platform functionality, as well as for enhancements and new functionality. The objective of this case study is to evaluate Recommender systems, Evaluation, Mahout these functional changes made to the platform by comparing the baseline collaborative filtering algorithms to the changed 1. INTRODUCTION algorithms using well known measures of accuracy and cov- Selecting a foundational platform is an important step in erage [6]. Our goal is not to validate algorithms that have developing recommender systems for personal, research, or already been tested previously, but to assess whether, and commercial purposes. This can be done in many different to what extent, the functional enhancements have improved ways: the platform may be developed from the ground up, the accuracy and coverage performance of the baseline out- an existing recommender engine may be contracted (e.g., of-the-box Mahout platform. Given the interplay between accuracy and coverage in this context, we developed a uni- Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are fied metric to assess accuracy vs. coverage trade-offs when not made or distributed for profit or commercial advantage and that copies evaluating functional changes made to Mahout’s collabora- bear this notice and the full citation on the first page. To copy otherwise, to tive filtering algorithms. republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. 1 http://download.oracle.com/docs/cd/B10464 05/bi.904/ b12102/1intro.htm Copyright is held by the author/owner(s). Workshop on Recommen- 2 http://lenskit.grouplens.org/ dation Utility Evaluation: Beyond RMSE (RUE 2012), held in conjunction 3 with ACM RecSys 2012. September 9, 2012, Dublin, Ireland. http://www.ismll.uni-hildesheim.de/mymedialite/ 4 Copyright 2012 ...$10.00 http://mahout.apache.org 45 2. RELATED WORK Similarity Weighting: Mahout implements the classic Pear- Revisiting evaluation in the context of recommender plat- son Correlation as described in [8, 5]. Similarity weighting is forms has received recent attention in the thorough evalua- supported in Mahout and consists of the following method: tion of the LensKit platform using previously tested collabo- scaleFactor = 1.0 - count / (num + 1); rative filtering algorithms and metrics, as reported in [2]. A if (result < 0.0) comprehensive set of guidelines for evaluating recommender result = -1.0 + scaleFactor * (1.0 + result); systems was provided by Herlocker et al [6]; these guidelines else highlight the use of evaluation metrics such as accuracy and result = 1.0 - scaleFactor * (1.0 - result); coverage and suggest the need for an ideal “general cover- where count is the number of co-rated items between two age metric” that would combine coverage with accuracy to users, num is the number of items in the dataset, and result yield an overall “practical accuracy” measure. Many of these is the calculated Pearson Correlation coefficient. evaluation metrics and techniques have also been covered re- User-Based Prediction Algorithm: Mahout implements a cently in [12]. Weighted Average prediction method similar to the approach Recommender system research has been primarily con- described in [1], except that Mahout does not take the abso- cerned with improving recommendation accuracy [7]; how- lute value of the individual similarities in the denominator, ever, other metrics such as coverage [10, 4] and also novelty however, it does ensure that the predicted ratings are within and serendipity [6, 3] have been deemed necessary because the allowable range, e.g., between 1.0 and 5.0. accuracy alone is not sufficient to properly evaluate the sys- Item-Based Prediction Algorithm: Mahout implements a tem. Mcnee et al [7] states that recommendations that are Weighted Average prediction method. This approach is sim- most accurate according to the standard metrics are some- ilar to the algorithm in [9], except that Mahout does not times not the most useful to users and outlines a more user- take the absolute value of the individual similarities in the centric approach to evaluation. The interplay between ac- denominator, however, it does ensure that the predicted rat- curacy and other metrics such as coverage and serendipity ings are within the allowable range, e.g., between 1.0 and creates trade-offs for recommender system implementers and 5.0. Also, Mahout does not provide support for neighbor- this has been widely discussed in the literature, e.g., see [4, hood formation, e.g., similarity thresholding, for item-based 3] and our previous work discussing trade-offs between ac- prediction. curacy and robustness [11]. Accuracy Evaluation calculation: Mahout executes the recommender system evaluator specified at run time (MAE 3. SELECTING APACHE MAHOUT or RMSE) and implements traditional techniques found in To support our research in collaborative filtering, sev- [6, 12]. For MAE, this would be, eral recommender system platforms were surveyed, includ- Pn ing LensKit, easyrec5 , and MymediaLite. We selected Ma- i=1 | ActualRatingi − P redictedRatingi | M AE = (1) hout because it provides many of the desired characteristics n required for a recommender development workbench plat- where n is the total number of ratings predicted in the test form. Mahout is a production-level, open-source, system run. and consists of a wide range of applications that are useful for a recommender system developer: collaborative filtering 3.2 Making Mahout Fit for Purpose algorithms, data clustering, and data classification. Mahout Through personal email communication with one of the is also highly scalable and is able to support distributed pro- Mahout developers, we were informed that Mahout intended cessing of large data sets across clusters of computers using to provide basic rating prediction and similarity weighting Hadoop6 . Mahout recommenders support various similarity capabilities for its recommenders and that it would be up and neighborhood formation calculations, recommendation to developers to provide more elaborate approaches. Sev- prediction algorithms include user-based, item-based, Slope- eral changes were made to the prediction algorithms and One and Singular Value Decomposition (SVD), and it also the similarity weighting techniques for both the user-based incorporates Root Mean Squared Error (RMSE) and Mean and item-based recommenders in order to meet our specific Absolute Error (MAE) evaluation methods. Mahout is read- requirements and to match the best practices found in the ily extensible and provides a wide range of Java classes for literature, as follows: customization. As an open-source project, the Mahout de- Similarity weighting: Defined as Significance Weighting in veloper/contributor community is very active; the Mahout [5], this consists of the following method: wiki also provides a list of developers and a list of websites that have implemented Mahout7 . scaleFactor = count/50.0; if (scaleFactor > 1.0) scaleFactor = 1.0; 3.1 Uncovering Mahout Details result = scaleFactor * result; Although Mahout is rich in documentation, there are im- where count is the number of co-rated items between two plementation details on how Mahout works that could only users, and result is the calculated Pearson Correlation co- be understood by looking at the source code. Thus, for clar- efficient. ity in evaluation, we needed to verify the implementation User-user mean-centered prediction: After identifying a of baseline platform functionality. The following describes neighborhood of similar users, a prediction, as documented some of these details for Mahout 0.4 ‘out-of-the-box’: in [8, 5, 1], is computed for a target item i and target user 5 u as follows: http://easyrec.org/ 6 P http://hadoop.apache.org/ simu,v (rv,i − rv ) 7 pu,i = ru + vV (2) https://cwiki.apache.org/MAHOUT/mahout-wiki.html P vV | simu,v | 46 where V is the set of k similar users who have rated item i, highest possible value) and view these metrics on a rela- rv,i is the rating of those users who have rated item i, ru is tive basis, i.e., how much the metric has increased or de- the average rating for the target user u over all rated items, creased beyond a baseline value based on empirical results. rv is the average rating for user v over all co-rated items, Furthermore, the interplay between accuracy and coverage, and simu,v is the Pearson correlation coefficient. i.e., coverage decreases as a function of accuracy [4, 3], cre- Item-item mean-centered prediction: A prediction, as doc- ates a trade-off for recommender system implementers that umented in [1], is computed for a target item i and target has been discussed previously but not been developed thor- user u as follows: oughly. Inspired by the suggestion in [6] to combine the cov- P erage and accuracy measures to yield an overall “practical jNu (i) simi,j (ru,j − rj ) pu,i = ri + P (3) accuracy” measure for the recommender system, we devel- jNu (i) | simi,j | oped a straightforward “AC Measure” that combines both where Nu (i) is the set of items rated by user u most similar accuracy and coverage into a single metric as follows: to item i, ru,j is u’s rating of item j, rj is the average rating Accuracyi for item j over all users who rated item j, ri is the average ACi = , (5) Coveragei rating for target item i, and simi,j is the Pearson correlation coefficient. where i indicates the ith trial in an evaluation experiment. Item-item similarity thresholding: This method was added to Mahout and used in conjunction with the item-item mean- centered prediction described above. Similarity threshold- ing, as described in [5], defines a level of similarity that is required for two items to be considered similar for purposes of making a recommendation prediction; item-item similar- ities that are less than the threshold are not used in the prediction calculation. Coverage and combined accuracy/coverage metric: As sug- gested in [6], the easiest way to measure coverage is to select a random sample of user-item pairs, ask for a prediction for each pair, and measure the percentage for which a predic- tion was provided. To calculate coverage, code changes were made to Mahout to provide, for each test run, the total num- ber of rating predictions requested that were unable to be calculated as well as the total of number of rating predic- tions requested that were actually calculated; the sum of these two numbers is the total number of ratings requested. Coverage was calculated as follows: Figure 1: Illustration of the AC Measure T otal#RatingsCalculated Coverage = (4) The AC Measure simply adjusts (upward) the Accuracy T otal#RatingsRequested according to the level of Coverage metrics found in an ex- Code changes were also made to calculate a combined accu- perimental trial and is agnostic to the accuracy metric used, racy and coverage metric as defined in Section 4. e.g., MAE or RMSE. Using a family of curves for the Mean Absolute Error (MAE) accuracy metric, Figure 1 illustrates 4. ACCURACY AND COVERAGE METRIC the relationship between accuracy, coverage, and the AC The metrics selected for this case study, accuracy and cov- Measure. As an example, following the “M AE : 0.5’’ curve erage, were chosen because they are fundamental to the util- we see that at 100% coverage, the AC Measure is 0.5, and ity of a recommender system [10, 6]. Although other metrics at 10% coverage, the AC Measure has increased to 5. The such as novelty and serendipity can, and should, be used in intuition behind this metric is that when the recommender conjunction with accuracy and coverage, our objective was system is able to provide predictions for a high percentage to evaluate the very basic requirements of a recommender of items in the dataset, the accuracy metric more closely system. Our implementation of coverage, referred to as pre- indicates the level of system performance; conversely, when diction coverage in [6], measures the percentage of a dataset the coverage is low, the accuracy metric is “penalized” and is for which the recommender system is able to provide predic- adjusted upwards. We believe that the major benefit of the tions. High coverage would indicate that the recommender AC Measure is that it formulates a solution for addressing system is able to provide predictions for a large number of the trade-off between accuracy and coverage and can be used items and is considered to be a desirable characteristic of to create a ranked list of results (low to high) from multiple the recommender system [6]. A combination of high accu- experimental trials to find the best (lowest) AC Measure for racy (low error rate) and high coverage are indeed desirable each set of test conditions. The simplified visualization of by users and system operators because it improves the util- the combined AC Measure shown in Figure 1 is an additional ity or usefulness of the system from a user standpoint [10, benefit. For our evaluation purposes, the use of a combined 6]. metric was ideal in addressing the inherent trade-offs be- What constitutes ‘good’ accuracy or coverage, however, tween accuracy and coverage, especially in the cases where has not been well defined in the literature: studies such accuracy is found to be high when coverage is low; we posit as [10, 4, 5] and many others, endeavor to maximize accu- that the AC Measure will also be useful for other researchers racy (achieve lowest possible value) and/or coverage (achieve performing evaluations using accuracy and coverage. 47 5. EXPERIMENTAL DESIGN ML100K dataset, the training set was 70% of the data, the The objective of this case study was to understand Ma- test set was 30% of the data, and 100% of the user data was hout’s baseline collaborative filtering algorithms and evalu- used; a total 30K rating predictions from 943 users were re- ate functional changes made to the platform using accuracy quested for each test set. For the tests using the ML10M and coverage metrics. The main intent of making functional dataset, the training set was 95% of the data, the test set changes to Mahout recommender algorithms was to bring was 5% of the data, and 5% of the user data was used; a the Mahout algorithms in line with best practices found in total 25K rating predictions from 3180 users were requested the literature. Therefore, the overall hypothesis to be tested for each test set. in this case study was that the modified algorithms improve Mahout’s ‘out-of-the-box’ prediction accuracy for both user- 5.1.4 Test Variations based and item-based recommenders while maintaining rea- Various similarity thresholds and kNN neighborhood sizes sonable coverage. were executed for each test case in order to understand and evaluate the corresponding behavior of the recommenders. 5.1 Datasets and Algorithms For User-based recommender testing, similarity thresholds The data used in this study were the MovieLens datasets of 0.0, 0.1, 0.3, 0.5, and 0.7 and kNN neighborhood sizes of downloaded from GroupLens Research8 : the 100K dataset 600, 400, 200, 100, 50, 20, 10, 5, and 2 were tested. For with 100,000 ratings for 1,682 movies and 943 users (re- Item-based recommender testing, in addition to using no ferred to as ML100K in this study) and the 10M dataset similarity thresholding, similarity thresholds of 0.0, 0.1, 0.2, with 10,000,000 ratings for 10,681 movies and 69,878 users 0.3, 0.4, 0.5, 0.6, and 0.7 were tested. (referred to as ML10M in this study). Ratings provided in these datasets consist of integer values between 1 (did not 6. RESULTS AND DISCUSSION like) to 5 (liked very much). For User-based (see §3.1), Mahout uses Pearson Corre- 6.1 ML10M Results lation similarity (with and without similarity weighting), Figures 2 and 3 show the results of test cases 1 through Neighborhood formation (similarity thresholding or kNN), 6 for user and item-based algorithms, respectively10 . The and Weighted Average prediction. This was tested against key results of the experiment, for both user-based and item- a modified algorithm (see §3.2) consisting of Pearson Cor- based algorithms unless otherwise noted, were as follows: relation similarity (with and without similarity weighting), 1. MAE for mean-centered prediction with significance Neighborhood formation (similarity thresholding or kNN), weighting is a significant improvement (p<0.01) over MAE and Mean-centered prediction. For Item-based (see §3.1), for Mahout prediction, regardless of weighting, across simi- Mahout uses Pearson Correlation similarity (with and with- larity thresholds (except item-based at similarity threshold out similarity weighting), no Neighborhood formation, and of 0.7) and kNN neighborhood sizes (except user-based at Weighted Average prediction. This was tested against a kNN of 2, not shown). modified algorithm (see §3.2) consisting of Pearson Corre- 2. Mahout similarity weighting does not significantly im- lation similarity (with and without similarity weighting), prove (p<0.01) Mahout prediction MAE over prediction with Neighborhood formation (similarity thresholding), and Mean- no similarity weighting (except Mahout prediction for user- centered prediction. based and item-based at a similarity threshold of 0.4, not 5.1.1 Test Cases shown). This would indicate that Mahout similarity weight- ing is not very effective as a weighting technique, especially In order to test the overall hypothesis, the following test as compared to significance weighting. cases were developed and executed for both user-based and item-based recommenders using the ML100K and ML10M 6.2 ML100K Results datasets: The results and trend lines for the ML100K experiment 1. Mahout Prediction, No weighting are similar to ML10M. The key results, for both user-based 2. Mahout Prediction, Mahout weighted and item-based algorithms unless otherwise noted, were: 3. Mahout Prediction, Significance weighted 1. MAE for mean-centered prediction with significance 4. Mean-Centered Prediction, No weighting weighting is a significant improvement (p<0.01) over MAE 5. Mean-Centered Prediction, Mahout weighted for Mahout prediction, regardless of weighting, across simi- 6. Mean-Centered Prediction, Significance weighted larity thresholds and kNN neighborhood sizes (except user- 5.1.2 Accuracy and Coverage Metrics based at kNN of 400). 2. Mahout similarity weighting does not significantly im- We used Mahout’s MAE evaluator to measure the accu- prove (p<0.01) Mahout prediction MAE over prediction with racy of the rating predictions. For prediction coverage, we used dataset training data to estimate the rating predictions training set and a test set, and the partitioning is performed for the test set; the random sample of user-item pairs in our by randomly selecting some ratings from all, or some, of the users. The selected ratings constitute the test set, while the testing was 30K pairs for ML100K and 25K pairs for ML10M remaining ones are the training set. (see §3.2). AC Measures were calculated for all test cases. 10 The following curves are superimposed over each other be- cause the values are very similar: MAE results for mean- 5.1.3 Dataset Partitioning centered prediction (no weighting and Mahout weighted), The Mahout evaluator creates holdout 9 partitions accord- MAE results for Mahout prediction (No weighting and ing to a set of run-time parameters. For the tests using the Mahout weighted), Coverage results for Mahout predic- 8 tion and mean-centered prediction (No weighting and Ma- http://www.grouplens.org hout weighted), Coverage results for Mahout prediction and 9 Holdout is a method that splits a dataset into two parts, a mean-centered prediction (both Significance weighted). 48 Figure 2: User-based Mahout Recommender Re- Figure 3: Item-based Mahout Recommender Re- sults for ML10M, Test cases 1 through 6 sults for ML10M, Test cases 1 through 6 no similarity weighting (except Mahout prediction for user- item-based algorithms using ML10M, respectively. Rather based and item-based at a similarity threshold of 0.4). than show all 30 results for each algorithm (5 similarity thresholds x 2 prediction methods x 3 weighting types), we 6.3 Discussion show only the results with calculated AC Measure values As hypothesized, results for both of the ML100K and less than 1.0; therefore, the lowest MAE results reported ML10M experiments show significant improvements in MAE above for user-based and item-based algorithms are clearly using the mean-centered prediction algorithm with signifi- beyond the range of this chart. We found that the best cance weighting compared to the Mahout baseline predic- combined accuracy/coverage results were found at higher tion algorithm. However, when coverage is considered, the levels of coverage and lower levels of similarity threshold, “best” MAE results may need a second look. Can an MAE i.e., the best (lowest) AC Measure for user-based was 0.688 of 0.5 or less be considered “good” when the associated cov- at a similarity threshold of 0.1 and for item-based was 0.665 erage is in the single digits? In this case, the recommender at a similarity threshold of 0.0, both using mean-centered system may only be able to provide recommendations to a prediction and significance weighting. We can also see that, very small subset of its users and is a situation that must with few exceptions, mean-centered prediction is improved be avoided by system operators. To help address the ac- over the Mahout prediction for the same similarity weight- curacy vs. coverage trade-off, combined measures such as ing and similarity threshold. We observed similar results the AC Measure (Section 4), can help by considering both using the ML100K dataset where the best (lowest) AC Mea- accuracy and coverage simultaneously. For the ML10M ex- sure for user-based was 0.765 and for item-based was 0.746, periment, we determined that the lowest MAE for the User- both at a similarity threshold of 0.0 and both using mean- based algorithm using mean-centered prediction with sig- centered prediction and significance weighting. These re- nificance weighting was 0.578 at a similarity threshold of sults demonstrate that the “best” MAE may not always be 0.7 and coverage of 0.833%; the AC Measure for this result the lowest MAE, especially when coverage is also considered; is calculated as 69.42. Similarly, the lowest MAE for the furthermore, recommender system settings such as similarity Item-based algorithm using mean-centered prediction with weighting and neighborhood size also need to be considered significance weighting was 0.371 at a similarity threshold of during system evaluation. 0.7 and coverage of 1.02%; the AC Measure for this result is Other observations of our experiments that match results calculated as 36.32. In each of these cases, the exceedingly reported in [5] and serve to validate our evaluation and in- high values for the AC Measure indicate that these results crease our confidence in the results are: (a) In general, signif- are not very desirable in a recommender system. icance weighting improves prediction MAE, as compared to Figures 4 and 5 show the AC Measure results for user and predictions using Mahout similarity weighting or no similar- 49 and that our combined measure will prove useful in evalu- ating algorithm changes for the inherent trade-offs between accuracy and coverage. 8. REFERENCES [1] C. Desrosiers and G. Karypis. A comprehensive survey of neighborhood-based recommendations methods. In F. Ricci, L. Rokach, B. Shapira, and P. B. Kantor, editors, Recommender Systems Handbook. Springer, 2011. [2] M. D. Ekstrand, M. Ludwig, J. A. Konnstan, and J. T. Riedl. Rethinking the recommender research ecosystem: Reproducibility, openness, and lenskit. In Proceedings of the 5th ACM Recommender Systems Conference (RecSys ’11), October 2011. [3] M. Ge, C. Delgado-Battenfeld, and D. Jannach. Figure 4: AC Measure for selected User-based re- Beyond accuracy: Evaluating recommender systems sults (lower is better) by coverage and serendipity. In Proceedings of the 4th ACM Recommender Systems Conference (RecSys ’10), September 2010. [4] N. Good, J. B. Schafer, J. A. Konstan, A. Borchers, B. Sarwar, J. Herlocker, and J. Riedl. Combining collaborative filtering with personal agents for better recommendations. In Proceedings of the 16th National Conference on Artificial Intelligence (AAAI-99), July 1999. [5] J. L. Herlocker, J. A. Konstan, A. Borchers, and J. Riedl. An algorithmic framework for performing collaborative filtering. In Proceedings of the ACM SIGIR Conference, 1999. [6] J. L. Herlocker, J. A. Konstan, L. G. Terveen, and J. Riedl. Evaluating collaborative filtering recommender systems. ACM Transactions on Information Systems, 22(1):5–53, 2004. [7] S. Mcnee, J. Riedl, and J. Konstan. Accurate is not always good: How accuracy metrics have hurt Figure 5: AC Measure for selected Item-based re- recommender systems. In Proceedings of the sults (lower is better) Conference on Human Factors in Computing Systems(CHI 2006), April 2006. [8] P. Resnick, N. Iacovou, M. Suchak, P. Bergstrom, and ity weighting; (b) As the similarity threshold increases, MAE J. Riedl. GroupLens: an open architecture for for mean-centered prediction with significance weighting im- collaborative filtering of netnews. In Proceedings of the proves and coverage degrades, whereas MAE and coverage ACM CSCW Conference, 1994. both degrade for Mahout prediction with Mahout weighting; [9] B. Sarwar, G. Karypis, J. Konstan, and J. Reidl. (c) Coverage decreases as neighborhood size decreases. Item-based collaborative filtering recommendation algorithms. In Proceedings of the World Wide Web 7. CONCLUSION Conference, 2001. Our case study of Mahout as a recommender system plat- [10] B. M. Sarwar, J. A. Konstan, A. Borchers, form highlights evaluation considerations for developers and J. Herlocker, B. Miller, and J. Riedl. Using filtering also shows how straightforward functional enhancements im- agents to improve prediction quality in the grouplens proves the performance of the baseline platform. We eval- research collaborative filtering system. In Proceedings uated our changes against current Mahout functionality us- of the ACM 1998 Conference on Computer Supported ing accuracy and coverage metrics not only to assess base- Cooperative Work (CSCW ’98), November 1998. line results, but also to provide a view of the trade-offs be- [11] C. E. Seminario and D. C. Wilson. Robustness and tween accuracy and coverage resulting from using different accuracy tradeoffs for recommender systems under recommender algorithms. We reported cases where the low- attack. In Proceedings of the 25th Florida Artificial est MAE accuracy results were not necessarily always the Intelligence Research Society Conference ‘best’ when coverage results were also considered, and we (FLAIRS-25), May 2012. instrumented Mahout for a combined accuracy and cover- [12] G. Shani and A. Gunawardana. Evaluating age metric (AC Measure) to evaluate these trade-offs more recommendation systems. In F. Ricci, L. Rokach, directly. We believe that this case study will provide use- B. Shapira, and P. B. Kantor, editors, Recommender ful guidance in using Mahout as a recommender platform, Systems Handbook. Springer, 2011. 50