=Paper= {{Paper |id=None |storemode=property |title=Case Study Evaluation of Mahout as a Recommender Platform |pdfUrl=https://ceur-ws.org/Vol-910/paper10.pdf |volume=Vol-910 |dblpUrl=https://dblp.org/rec/conf/recsys/SeminarioW12 }} ==Case Study Evaluation of Mahout as a Recommender Platform== https://ceur-ws.org/Vol-910/paper10.pdf

Case Study Evaluation of Mahout as a Recommender
Platform

Carlos E. Seminario David C. Wilson
Software and Information Systems Dept. Software and Information Systems Dept.
University of North Carolina Charlotte University of North Carolina Charlotte
cseminar@uncc.edu davils@uncc.edu

ABSTRACT OracleAS Personalization1 ), code libraries can be adapted,
Various libraries have been released to support the devel- or a platform may be selected and tailored to suit (e.g.,
opment of recommender systems for some time, but it is LensKit2 , MymediaLite3 , Apache Mahout4 , etc.). In some
only relatively recently that larger scale, open-source plat- cases, a combination of these approaches will be employed.
forms have become readily available. In the context of such For many projects, and particularly in the research con-
platforms, evaluation tools are important both to verify and text, the ideal situation is to find an open-source platform
validate baseline platform functionality, as well as to pro- with many active contributors that provides a rich and var-
vide support for testing new techniques and approaches de- ied set of recommender system functions that meets all or
veloped on top of the platform. We have adopted Apache most of the baseline development requirements. Short of
Mahout as an enabling platform for our research and have finding this ideal solution, some minor customization to an
faced both of these issues in employing it as part of our already existing system may be the best approach to meet
work in collaborative filtering. This paper presents a case the specific development requirements. Various libraries have
study of evaluation focusing on accuracy and coverage eval- been released to support the development of recommender
uation metrics in Apache Mahout, a recent platform tool systems for some time, but it is only relatively recently
that provides support for recommender system application that larger scale, open-source platforms have become readily
development. As part of this case study, we developed a new available. In the context of such platforms, evaluation tools
metric combining accuracy and coverage in order to evaluate are important both to verify and validate baseline platform
functional changes made to Mahout’s collaborative filtering functionality, as well as to provide support for testing new
algorithms. techniques and approaches developed on top of the platform.
We have adopted Apache Mahout as an enabling platform
for our research and have faced both of these issues in em-
Categories and Subject Descriptors ploying it as part of our work in collaborative filtering rec-
H.3.3 [Information Storage and Retrieval]: Information ommenders.
Search and Retrieval–Information filtering This paper presents a case study of evaluation for rec-
ommender systems in Apache Mahout, focusing on metrics
for accuracy and coverage. We have developed functional
General Terms changes to the baseline Mahout collaborative filtering algo-
Algorithms, Experimentation, Measurement rithms to meet our research purposes, and this paper exam-
ines evaluation both from the standpoint of tools for baseline
Keywords platform functionality, as well as for enhancements and new
functionality. The objective of this case study is to evaluate
Recommender systems, Evaluation, Mahout these functional changes made to the platform by comparing
the baseline collaborative filtering algorithms to the changed
1. INTRODUCTION algorithms using well known measures of accuracy and cov-
Selecting a foundational platform is an important step in erage [6]. Our goal is not to validate algorithms that have
developing recommender systems for personal, research, or already been tested previously, but to assess whether, and
commercial purposes. This can be done in many different to what extent, the functional enhancements have improved
ways: the platform may be developed from the ground up, the accuracy and coverage performance of the baseline out-
an existing recommender engine may be contracted (e.g., of-the-box Mahout platform. Given the interplay between
accuracy and coverage in this context, we developed a uni-
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
fied metric to assess accuracy vs. coverage trade-offs when
not made or distributed for profit or commercial advantage and that copies evaluating functional changes made to Mahout’s collabora-
bear this notice and the full citation on the first page. To copy otherwise, to tive filtering algorithms.
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee. 1
http://download.oracle.com/docs/cd/B10464 05/bi.904/
b12102/1intro.htm
Copyright is held by the author/owner(s). Workshop on Recommen- 2
http://lenskit.grouplens.org/
dation Utility Evaluation: Beyond RMSE (RUE 2012), held in conjunction 3
with ACM RecSys 2012. September 9, 2012, Dublin, Ireland. http://www.ismll.uni-hildesheim.de/mymedialite/
4
Copyright 2012 ...$10.00 http://mahout.apache.org

45
2. RELATED WORK Similarity Weighting: Mahout implements the classic Pear-
Revisiting evaluation in the context of recommender plat- son Correlation as described in [8, 5]. Similarity weighting is
forms has received recent attention in the thorough evalua- supported in Mahout and consists of the following method:
tion of the LensKit platform using previously tested collabo- scaleFactor = 1.0 - count / (num + 1);
rative filtering algorithms and metrics, as reported in [2]. A if (result < 0.0)
comprehensive set of guidelines for evaluating recommender result = -1.0 + scaleFactor * (1.0 + result);
systems was provided by Herlocker et al [6]; these guidelines else
highlight the use of evaluation metrics such as accuracy and result = 1.0 - scaleFactor * (1.0 - result);
coverage and suggest the need for an ideal “general cover- where count is the number of co-rated items between two
age metric” that would combine coverage with accuracy to users, num is the number of items in the dataset, and result
yield an overall “practical accuracy” measure. Many of these is the calculated Pearson Correlation coefficient.
evaluation metrics and techniques have also been covered re- User-Based Prediction Algorithm: Mahout implements a
cently in [12]. Weighted Average prediction method similar to the approach
Recommender system research has been primarily con- described in [1], except that Mahout does not take the abso-
cerned with improving recommendation accuracy [7]; how- lute value of the individual similarities in the denominator,
ever, other metrics such as coverage [10, 4] and also novelty however, it does ensure that the predicted ratings are within
and serendipity [6, 3] have been deemed necessary because the allowable range, e.g., between 1.0 and 5.0.
accuracy alone is not sufficient to properly evaluate the sys- Item-Based Prediction Algorithm: Mahout implements a
tem. Mcnee et al [7] states that recommendations that are Weighted Average prediction method. This approach is sim-
most accurate according to the standard metrics are some- ilar to the algorithm in [9], except that Mahout does not
times not the most useful to users and outlines a more user- take the absolute value of the individual similarities in the
centric approach to evaluation. The interplay between ac- denominator, however, it does ensure that the predicted rat-
curacy and other metrics such as coverage and serendipity ings are within the allowable range, e.g., between 1.0 and
creates trade-offs for recommender system implementers and 5.0. Also, Mahout does not provide support for neighbor-
this has been widely discussed in the literature, e.g., see [4, hood formation, e.g., similarity thresholding, for item-based
3] and our previous work discussing trade-offs between ac- prediction.
curacy and robustness [11]. Accuracy Evaluation calculation: Mahout executes the
recommender system evaluator specified at run time (MAE
3. SELECTING APACHE MAHOUT or RMSE) and implements traditional techniques found in
To support our research in collaborative filtering, sev- [6, 12]. For MAE, this would be,
eral recommender system platforms were surveyed, includ- Pn
ing LensKit, easyrec5 , and MymediaLite. We selected Ma- i=1 | ActualRatingi − P redictedRatingi |
M AE = (1)
hout because it provides many of the desired characteristics n
required for a recommender development workbench plat- where n is the total number of ratings predicted in the test
form. Mahout is a production-level, open-source, system run.
and consists of a wide range of applications that are useful
for a recommender system developer: collaborative filtering 3.2 Making Mahout Fit for Purpose
algorithms, data clustering, and data classification. Mahout
Through personal email communication with one of the
is also highly scalable and is able to support distributed pro-
Mahout developers, we were informed that Mahout intended
cessing of large data sets across clusters of computers using
to provide basic rating prediction and similarity weighting
Hadoop6 . Mahout recommenders support various similarity
capabilities for its recommenders and that it would be up
and neighborhood formation calculations, recommendation
to developers to provide more elaborate approaches. Sev-
prediction algorithms include user-based, item-based, Slope-
eral changes were made to the prediction algorithms and
One and Singular Value Decomposition (SVD), and it also
the similarity weighting techniques for both the user-based
incorporates Root Mean Squared Error (RMSE) and Mean
and item-based recommenders in order to meet our specific
Absolute Error (MAE) evaluation methods. Mahout is read-
requirements and to match the best practices found in the
ily extensible and provides a wide range of Java classes for
literature, as follows:
customization. As an open-source project, the Mahout de-
Similarity weighting: Defined as Significance Weighting in
veloper/contributor community is very active; the Mahout
[5], this consists of the following method:
wiki also provides a list of developers and a list of websites
that have implemented Mahout7 . scaleFactor = count/50.0;
if (scaleFactor > 1.0) scaleFactor = 1.0;
3.1 Uncovering Mahout Details result = scaleFactor * result;
Although Mahout is rich in documentation, there are im- where count is the number of co-rated items between two
plementation details on how Mahout works that could only users, and result is the calculated Pearson Correlation co-
be understood by looking at the source code. Thus, for clar- efficient.
ity in evaluation, we needed to verify the implementation User-user mean-centered prediction: After identifying a
of baseline platform functionality. The following describes neighborhood of similar users, a prediction, as documented
some of these details for Mahout 0.4 ‘out-of-the-box’: in [8, 5, 1], is computed for a target item i and target user
5
u as follows:
http://easyrec.org/
6
P
http://hadoop.apache.org/ simu,v (rv,i − rv )
7 pu,i = ru + vV (2)
https://cwiki.apache.org/MAHOUT/mahout-wiki.html
P
vV | simu,v |

46
where V is the set of k similar users who have rated item i, highest possible value) and view these metrics on a rela-
rv,i is the rating of those users who have rated item i, ru is tive basis, i.e., how much the metric has increased or de-
the average rating for the target user u over all rated items, creased beyond a baseline value based on empirical results.
rv is the average rating for user v over all co-rated items, Furthermore, the interplay between accuracy and coverage,
and simu,v is the Pearson correlation coefficient. i.e., coverage decreases as a function of accuracy [4, 3], cre-
Item-item mean-centered prediction: A prediction, as doc- ates a trade-off for recommender system implementers that
umented in [1], is computed for a target item i and target has been discussed previously but not been developed thor-
user u as follows: oughly. Inspired by the suggestion in [6] to combine the cov-
P erage and accuracy measures to yield an overall “practical
jNu (i) simi,j (ru,j − rj )
pu,i = ri + P (3) accuracy” measure for the recommender system, we devel-
jNu (i) | simi,j | oped a straightforward “AC Measure” that combines both
where Nu (i) is the set of items rated by user u most similar accuracy and coverage into a single metric as follows:
to item i, ru,j is u’s rating of item j, rj is the average rating Accuracyi
for item j over all users who rated item j, ri is the average ACi = , (5)
Coveragei
rating for target item i, and simi,j is the Pearson correlation
coefficient. where i indicates the ith trial in an evaluation experiment.
Item-item similarity thresholding: This method was added
to Mahout and used in conjunction with the item-item mean-
centered prediction described above. Similarity threshold-
ing, as described in [5], defines a level of similarity that is
required for two items to be considered similar for purposes
of making a recommendation prediction; item-item similar-
ities that are less than the threshold are not used in the
prediction calculation.
Coverage and combined accuracy/coverage metric: As sug-
gested in [6], the easiest way to measure coverage is to select
a random sample of user-item pairs, ask for a prediction for
each pair, and measure the percentage for which a predic-
tion was provided. To calculate coverage, code changes were
made to Mahout to provide, for each test run, the total num-
ber of rating predictions requested that were unable to be
calculated as well as the total of number of rating predic-
tions requested that were actually calculated; the sum of
these two numbers is the total number of ratings requested.
Coverage was calculated as follows: Figure 1: Illustration of the AC Measure
T otal#RatingsCalculated
Coverage = (4) The AC Measure simply adjusts (upward) the Accuracy
T otal#RatingsRequested
according to the level of Coverage metrics found in an ex-
Code changes were also made to calculate a combined accu- perimental trial and is agnostic to the accuracy metric used,
racy and coverage metric as defined in Section 4. e.g., MAE or RMSE. Using a family of curves for the Mean
Absolute Error (MAE) accuracy metric, Figure 1 illustrates
4. ACCURACY AND COVERAGE METRIC the relationship between accuracy, coverage, and the AC
The metrics selected for this case study, accuracy and cov- Measure. As an example, following the “M AE : 0.5’’ curve
erage, were chosen because they are fundamental to the util- we see that at 100% coverage, the AC Measure is 0.5, and
ity of a recommender system [10, 6]. Although other metrics at 10% coverage, the AC Measure has increased to 5. The
such as novelty and serendipity can, and should, be used in intuition behind this metric is that when the recommender
conjunction with accuracy and coverage, our objective was system is able to provide predictions for a high percentage
to evaluate the very basic requirements of a recommender of items in the dataset, the accuracy metric more closely
system. Our implementation of coverage, referred to as pre- indicates the level of system performance; conversely, when
diction coverage in [6], measures the percentage of a dataset the coverage is low, the accuracy metric is “penalized” and is
for which the recommender system is able to provide predic- adjusted upwards. We believe that the major benefit of the
tions. High coverage would indicate that the recommender AC Measure is that it formulates a solution for addressing
system is able to provide predictions for a large number of the trade-off between accuracy and coverage and can be used
items and is considered to be a desirable characteristic of to create a ranked list of results (low to high) from multiple
the recommender system [6]. A combination of high accu- experimental trials to find the best (lowest) AC Measure for
racy (low error rate) and high coverage are indeed desirable each set of test conditions. The simplified visualization of
by users and system operators because it improves the util- the combined AC Measure shown in Figure 1 is an additional
ity or usefulness of the system from a user standpoint [10, benefit. For our evaluation purposes, the use of a combined
6]. metric was ideal in addressing the inherent trade-offs be-
What constitutes ‘good’ accuracy or coverage, however, tween accuracy and coverage, especially in the cases where
has not been well defined in the literature: studies such accuracy is found to be high when coverage is low; we posit
as [10, 4, 5] and many others, endeavor to maximize accu- that the AC Measure will also be useful for other researchers
racy (achieve lowest possible value) and/or coverage (achieve performing evaluations using accuracy and coverage.

47
5. EXPERIMENTAL DESIGN ML100K dataset, the training set was 70% of the data, the
The objective of this case study was to understand Ma- test set was 30% of the data, and 100% of the user data was
hout’s baseline collaborative filtering algorithms and evalu- used; a total 30K rating predictions from 943 users were re-
ate functional changes made to the platform using accuracy quested for each test set. For the tests using the ML10M
and coverage metrics. The main intent of making functional dataset, the training set was 95% of the data, the test set
changes to Mahout recommender algorithms was to bring was 5% of the data, and 5% of the user data was used; a
the Mahout algorithms in line with best practices found in total 25K rating predictions from 3180 users were requested
the literature. Therefore, the overall hypothesis to be tested for each test set.
in this case study was that the modified algorithms improve
Mahout’s ‘out-of-the-box’ prediction accuracy for both user-
5.1.4 Test Variations
based and item-based recommenders while maintaining rea- Various similarity thresholds and kNN neighborhood sizes
sonable coverage. were executed for each test case in order to understand and
evaluate the corresponding behavior of the recommenders.
5.1 Datasets and Algorithms For User-based recommender testing, similarity thresholds
The data used in this study were the MovieLens datasets of 0.0, 0.1, 0.3, 0.5, and 0.7 and kNN neighborhood sizes of
downloaded from GroupLens Research8 : the 100K dataset 600, 400, 200, 100, 50, 20, 10, 5, and 2 were tested. For
with 100,000 ratings for 1,682 movies and 943 users (re- Item-based recommender testing, in addition to using no
ferred to as ML100K in this study) and the 10M dataset similarity thresholding, similarity thresholds of 0.0, 0.1, 0.2,
with 10,000,000 ratings for 10,681 movies and 69,878 users 0.3, 0.4, 0.5, 0.6, and 0.7 were tested.
(referred to as ML10M in this study). Ratings provided in
these datasets consist of integer values between 1 (did not 6. RESULTS AND DISCUSSION
like) to 5 (liked very much).
For User-based (see §3.1), Mahout uses Pearson Corre- 6.1 ML10M Results
lation similarity (with and without similarity weighting), Figures 2 and 3 show the results of test cases 1 through
Neighborhood formation (similarity thresholding or kNN), 6 for user and item-based algorithms, respectively10 . The
and Weighted Average prediction. This was tested against key results of the experiment, for both user-based and item-
a modified algorithm (see §3.2) consisting of Pearson Cor- based algorithms unless otherwise noted, were as follows:
relation similarity (with and without similarity weighting), 1. MAE for mean-centered prediction with significance
Neighborhood formation (similarity thresholding or kNN), weighting is a significant improvement (p<0.01) over MAE
and Mean-centered prediction. For Item-based (see §3.1), for Mahout prediction, regardless of weighting, across simi-
Mahout uses Pearson Correlation similarity (with and with- larity thresholds (except item-based at similarity threshold
out similarity weighting), no Neighborhood formation, and of 0.7) and kNN neighborhood sizes (except user-based at
Weighted Average prediction. This was tested against a kNN of 2, not shown).
modified algorithm (see §3.2) consisting of Pearson Corre- 2. Mahout similarity weighting does not significantly im-
lation similarity (with and without similarity weighting), prove (p<0.01) Mahout prediction MAE over prediction with
Neighborhood formation (similarity thresholding), and Mean- no similarity weighting (except Mahout prediction for user-
centered prediction. based and item-based at a similarity threshold of 0.4, not
5.1.1 Test Cases shown). This would indicate that Mahout similarity weight-
ing is not very effective as a weighting technique, especially
In order to test the overall hypothesis, the following test
as compared to significance weighting.
cases were developed and executed for both user-based and
item-based recommenders using the ML100K and ML10M 6.2 ML100K Results
datasets:
The results and trend lines for the ML100K experiment
1. Mahout Prediction, No weighting are similar to ML10M. The key results, for both user-based
2. Mahout Prediction, Mahout weighted and item-based algorithms unless otherwise noted, were:
3. Mahout Prediction, Significance weighted 1. MAE for mean-centered prediction with significance
4. Mean-Centered Prediction, No weighting weighting is a significant improvement (p<0.01) over MAE
5. Mean-Centered Prediction, Mahout weighted for Mahout prediction, regardless of weighting, across simi-
6. Mean-Centered Prediction, Significance weighted larity thresholds and kNN neighborhood sizes (except user-
5.1.2 Accuracy and Coverage Metrics based at kNN of 400).
2. Mahout similarity weighting does not significantly im-
We used Mahout’s MAE evaluator to measure the accu- prove (p<0.01) Mahout prediction MAE over prediction with
racy of the rating predictions. For prediction coverage, we
used dataset training data to estimate the rating predictions training set and a test set, and the partitioning is performed
for the test set; the random sample of user-item pairs in our by randomly selecting some ratings from all, or some, of the
users. The selected ratings constitute the test set, while the
testing was 30K pairs for ML100K and 25K pairs for ML10M remaining ones are the training set.
(see §3.2). AC Measures were calculated for all test cases. 10
The following curves are superimposed over each other be-
cause the values are very similar: MAE results for mean-
5.1.3 Dataset Partitioning centered prediction (no weighting and Mahout weighted),
The Mahout evaluator creates holdout 9 partitions accord- MAE results for Mahout prediction (No weighting and
ing to a set of run-time parameters. For the tests using the Mahout weighted), Coverage results for Mahout predic-
8
tion and mean-centered prediction (No weighting and Ma-
http://www.grouplens.org hout weighted), Coverage results for Mahout prediction and
9
Holdout is a method that splits a dataset into two parts, a mean-centered prediction (both Significance weighted).

48
Figure 2: User-based Mahout Recommender Re- Figure 3: Item-based Mahout Recommender Re-
sults for ML10M, Test cases 1 through 6 sults for ML10M, Test cases 1 through 6

no similarity weighting (except Mahout prediction for user- item-based algorithms using ML10M, respectively. Rather
based and item-based at a similarity threshold of 0.4). than show all 30 results for each algorithm (5 similarity
thresholds x 2 prediction methods x 3 weighting types), we
6.3 Discussion show only the results with calculated AC Measure values
As hypothesized, results for both of the ML100K and less than 1.0; therefore, the lowest MAE results reported
ML10M experiments show significant improvements in MAE above for user-based and item-based algorithms are clearly
using the mean-centered prediction algorithm with signifi- beyond the range of this chart. We found that the best
cance weighting compared to the Mahout baseline predic- combined accuracy/coverage results were found at higher
tion algorithm. However, when coverage is considered, the levels of coverage and lower levels of similarity threshold,
“best” MAE results may need a second look. Can an MAE i.e., the best (lowest) AC Measure for user-based was 0.688
of 0.5 or less be considered “good” when the associated cov- at a similarity threshold of 0.1 and for item-based was 0.665
erage is in the single digits? In this case, the recommender at a similarity threshold of 0.0, both using mean-centered
system may only be able to provide recommendations to a prediction and significance weighting. We can also see that,
very small subset of its users and is a situation that must with few exceptions, mean-centered prediction is improved
be avoided by system operators. To help address the ac- over the Mahout prediction for the same similarity weight-
curacy vs. coverage trade-off, combined measures such as ing and similarity threshold. We observed similar results
the AC Measure (Section 4), can help by considering both using the ML100K dataset where the best (lowest) AC Mea-
accuracy and coverage simultaneously. For the ML10M ex- sure for user-based was 0.765 and for item-based was 0.746,
periment, we determined that the lowest MAE for the User- both at a similarity threshold of 0.0 and both using mean-
based algorithm using mean-centered prediction with sig- centered prediction and significance weighting. These re-
nificance weighting was 0.578 at a similarity threshold of sults demonstrate that the “best” MAE may not always be
0.7 and coverage of 0.833%; the AC Measure for this result the lowest MAE, especially when coverage is also considered;
is calculated as 69.42. Similarly, the lowest MAE for the furthermore, recommender system settings such as similarity
Item-based algorithm using mean-centered prediction with weighting and neighborhood size also need to be considered
significance weighting was 0.371 at a similarity threshold of during system evaluation.
0.7 and coverage of 1.02%; the AC Measure for this result is Other observations of our experiments that match results
calculated as 36.32. In each of these cases, the exceedingly reported in [5] and serve to validate our evaluation and in-
high values for the AC Measure indicate that these results crease our confidence in the results are: (a) In general, signif-
are not very desirable in a recommender system. icance weighting improves prediction MAE, as compared to
Figures 4 and 5 show the AC Measure results for user and predictions using Mahout similarity weighting or no similar-

49
and that our combined measure will prove useful in evalu-
ating algorithm changes for the inherent trade-offs between
accuracy and coverage.

8. REFERENCES
[1] C. Desrosiers and G. Karypis. A comprehensive survey
of neighborhood-based recommendations methods. In
F. Ricci, L. Rokach, B. Shapira, and P. B. Kantor,
editors, Recommender Systems Handbook. Springer,
2011.
[2] M. D. Ekstrand, M. Ludwig, J. A. Konnstan, and
J. T. Riedl. Rethinking the recommender research
ecosystem: Reproducibility, openness, and lenskit. In
Proceedings of the 5th ACM Recommender Systems
Conference (RecSys ’11), October 2011.
[3] M. Ge, C. Delgado-Battenfeld, and D. Jannach.
Figure 4: AC Measure for selected User-based re- Beyond accuracy: Evaluating recommender systems
sults (lower is better) by coverage and serendipity. In Proceedings of the 4th
ACM Recommender Systems Conference (RecSys ’10),
September 2010.
[4] N. Good, J. B. Schafer, J. A. Konstan, A. Borchers,
B. Sarwar, J. Herlocker, and J. Riedl. Combining
collaborative filtering with personal agents for better
recommendations. In Proceedings of the 16th National
Conference on Artificial Intelligence (AAAI-99), July
1999.
[5] J. L. Herlocker, J. A. Konstan, A. Borchers, and
J. Riedl. An algorithmic framework for performing
collaborative filtering. In Proceedings of the ACM
SIGIR Conference, 1999.
[6] J. L. Herlocker, J. A. Konstan, L. G. Terveen, and
J. Riedl. Evaluating collaborative filtering
recommender systems. ACM Transactions on
Information Systems, 22(1):5–53, 2004.
[7] S. Mcnee, J. Riedl, and J. Konstan. Accurate is not
always good: How accuracy metrics have hurt
Figure 5: AC Measure for selected Item-based re- recommender systems. In Proceedings of the
sults (lower is better) Conference on Human Factors in Computing
Systems(CHI 2006), April 2006.
[8] P. Resnick, N. Iacovou, M. Suchak, P. Bergstrom, and
ity weighting; (b) As the similarity threshold increases, MAE J. Riedl. GroupLens: an open architecture for
for mean-centered prediction with significance weighting im- collaborative filtering of netnews. In Proceedings of the
proves and coverage degrades, whereas MAE and coverage ACM CSCW Conference, 1994.
both degrade for Mahout prediction with Mahout weighting; [9] B. Sarwar, G. Karypis, J. Konstan, and J. Reidl.
(c) Coverage decreases as neighborhood size decreases. Item-based collaborative filtering recommendation
algorithms. In Proceedings of the World Wide Web
7. CONCLUSION Conference, 2001.
Our case study of Mahout as a recommender system plat- [10] B. M. Sarwar, J. A. Konstan, A. Borchers,
form highlights evaluation considerations for developers and J. Herlocker, B. Miller, and J. Riedl. Using filtering
also shows how straightforward functional enhancements im- agents to improve prediction quality in the grouplens
proves the performance of the baseline platform. We eval- research collaborative filtering system. In Proceedings
uated our changes against current Mahout functionality us- of the ACM 1998 Conference on Computer Supported
ing accuracy and coverage metrics not only to assess base- Cooperative Work (CSCW ’98), November 1998.
line results, but also to provide a view of the trade-offs be- [11] C. E. Seminario and D. C. Wilson. Robustness and
tween accuracy and coverage resulting from using different accuracy tradeoffs for recommender systems under
recommender algorithms. We reported cases where the low- attack. In Proceedings of the 25th Florida Artificial
est MAE accuracy results were not necessarily always the Intelligence Research Society Conference
‘best’ when coverage results were also considered, and we (FLAIRS-25), May 2012.
instrumented Mahout for a combined accuracy and cover- [12] G. Shani and A. Gunawardana. Evaluating
age metric (AC Measure) to evaluate these trade-offs more recommendation systems. In F. Ricci, L. Rokach,
directly. We believe that this case study will provide use- B. Shapira, and P. B. Kantor, editors, Recommender
ful guidance in using Mahout as a recommender platform, Systems Handbook. Springer, 2011.