Case Study Evaluation of Mahout as a Recommender
                             Platform

                             Carlos E. Seminario                                                         David C. Wilson
                Software and Information Systems Dept.                                       Software and Information Systems Dept.
                 University of North Carolina Charlotte                                       University of North Carolina Charlotte
                            cseminar@uncc.edu                                                           davils@uncc.edu


ABSTRACT                                                                                  OracleAS Personalization1 ), code libraries can be adapted,
Various libraries have been released to support the devel-                                or a platform may be selected and tailored to suit (e.g.,
opment of recommender systems for some time, but it is                                    LensKit2 , MymediaLite3 , Apache Mahout4 , etc.). In some
only relatively recently that larger scale, open-source plat-                             cases, a combination of these approaches will be employed.
forms have become readily available. In the context of such                                  For many projects, and particularly in the research con-
platforms, evaluation tools are important both to verify and                              text, the ideal situation is to find an open-source platform
validate baseline platform functionality, as well as to pro-                              with many active contributors that provides a rich and var-
vide support for testing new techniques and approaches de-                                ied set of recommender system functions that meets all or
veloped on top of the platform. We have adopted Apache                                    most of the baseline development requirements. Short of
Mahout as an enabling platform for our research and have                                  finding this ideal solution, some minor customization to an
faced both of these issues in employing it as part of our                                 already existing system may be the best approach to meet
work in collaborative filtering. This paper presents a case                               the specific development requirements. Various libraries have
study of evaluation focusing on accuracy and coverage eval-                               been released to support the development of recommender
uation metrics in Apache Mahout, a recent platform tool                                   systems for some time, but it is only relatively recently
that provides support for recommender system application                                  that larger scale, open-source platforms have become readily
development. As part of this case study, we developed a new                               available. In the context of such platforms, evaluation tools
metric combining accuracy and coverage in order to evaluate                               are important both to verify and validate baseline platform
functional changes made to Mahout’s collaborative filtering                               functionality, as well as to provide support for testing new
algorithms.                                                                               techniques and approaches developed on top of the platform.
                                                                                          We have adopted Apache Mahout as an enabling platform
                                                                                          for our research and have faced both of these issues in em-
Categories and Subject Descriptors                                                        ploying it as part of our work in collaborative filtering rec-
H.3.3 [Information Storage and Retrieval]: Information                                    ommenders.
Search and Retrieval–Information filtering                                                   This paper presents a case study of evaluation for rec-
                                                                                          ommender systems in Apache Mahout, focusing on metrics
                                                                                          for accuracy and coverage. We have developed functional
General Terms                                                                             changes to the baseline Mahout collaborative filtering algo-
Algorithms, Experimentation, Measurement                                                  rithms to meet our research purposes, and this paper exam-
                                                                                          ines evaluation both from the standpoint of tools for baseline
Keywords                                                                                  platform functionality, as well as for enhancements and new
                                                                                          functionality. The objective of this case study is to evaluate
Recommender systems, Evaluation, Mahout                                                   these functional changes made to the platform by comparing
                                                                                          the baseline collaborative filtering algorithms to the changed
1.     INTRODUCTION                                                                       algorithms using well known measures of accuracy and cov-
  Selecting a foundational platform is an important step in                               erage [6]. Our goal is not to validate algorithms that have
developing recommender systems for personal, research, or                                 already been tested previously, but to assess whether, and
commercial purposes. This can be done in many different                                   to what extent, the functional enhancements have improved
ways: the platform may be developed from the ground up,                                   the accuracy and coverage performance of the baseline out-
an existing recommender engine may be contracted (e.g.,                                   of-the-box Mahout platform. Given the interplay between
                                                                                          accuracy and coverage in this context, we developed a uni-
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
                                                                                          fied metric to assess accuracy vs. coverage trade-offs when
not made or distributed for profit or commercial advantage and that copies                evaluating functional changes made to Mahout’s collabora-
bear this notice and the full citation on the first page. To copy otherwise, to           tive filtering algorithms.
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.                                                                  1
                                                                                            http://download.oracle.com/docs/cd/B10464 05/bi.904/
                                                                                          b12102/1intro.htm
Copyright is held by the author/owner(s). Workshop on Recommen-                           2
                                                                                            http://lenskit.grouplens.org/
dation Utility Evaluation: Beyond RMSE (RUE 2012), held in conjunction                    3
with ACM RecSys 2012. September 9, 2012, Dublin, Ireland.                                   http://www.ismll.uni-hildesheim.de/mymedialite/
                                                                                          4
Copyright 2012 ...$10.00                                                                    http://mahout.apache.org


                                                                                     45
2.    RELATED WORK                                                        Similarity Weighting: Mahout implements the classic Pear-
   Revisiting evaluation in the context of recommender plat-           son Correlation as described in [8, 5]. Similarity weighting is
forms has received recent attention in the thorough evalua-            supported in Mahout and consists of the following method:
tion of the LensKit platform using previously tested collabo-                   scaleFactor = 1.0 - count / (num + 1);
rative filtering algorithms and metrics, as reported in [2]. A                  if (result < 0.0)
comprehensive set of guidelines for evaluating recommender                         result = -1.0 + scaleFactor * (1.0 + result);
systems was provided by Herlocker et al [6]; these guidelines                   else
highlight the use of evaluation metrics such as accuracy and                       result = 1.0 - scaleFactor * (1.0 - result);
coverage and suggest the need for an ideal “general cover-             where count is the number of co-rated items between two
age metric” that would combine coverage with accuracy to               users, num is the number of items in the dataset, and result
yield an overall “practical accuracy” measure. Many of these           is the calculated Pearson Correlation coefficient.
evaluation metrics and techniques have also been covered re-              User-Based Prediction Algorithm: Mahout implements a
cently in [12].                                                        Weighted Average prediction method similar to the approach
   Recommender system research has been primarily con-                 described in [1], except that Mahout does not take the abso-
cerned with improving recommendation accuracy [7]; how-                lute value of the individual similarities in the denominator,
ever, other metrics such as coverage [10, 4] and also novelty          however, it does ensure that the predicted ratings are within
and serendipity [6, 3] have been deemed necessary because              the allowable range, e.g., between 1.0 and 5.0.
accuracy alone is not sufficient to properly evaluate the sys-            Item-Based Prediction Algorithm: Mahout implements a
tem. Mcnee et al [7] states that recommendations that are              Weighted Average prediction method. This approach is sim-
most accurate according to the standard metrics are some-              ilar to the algorithm in [9], except that Mahout does not
times not the most useful to users and outlines a more user-           take the absolute value of the individual similarities in the
centric approach to evaluation. The interplay between ac-              denominator, however, it does ensure that the predicted rat-
curacy and other metrics such as coverage and serendipity              ings are within the allowable range, e.g., between 1.0 and
creates trade-offs for recommender system implementers and             5.0. Also, Mahout does not provide support for neighbor-
this has been widely discussed in the literature, e.g., see [4,        hood formation, e.g., similarity thresholding, for item-based
3] and our previous work discussing trade-offs between ac-             prediction.
curacy and robustness [11].                                               Accuracy Evaluation calculation: Mahout executes the
                                                                       recommender system evaluator specified at run time (MAE
3.    SELECTING APACHE MAHOUT                                          or RMSE) and implements traditional techniques found in
   To support our research in collaborative filtering, sev-            [6, 12]. For MAE, this would be,
eral recommender system platforms were surveyed, includ-                            Pn
ing LensKit, easyrec5 , and MymediaLite. We selected Ma-                              i=1 | ActualRatingi − P redictedRatingi |
                                                                         M AE =                                                   (1)
hout because it provides many of the desired characteristics                                             n
required for a recommender development workbench plat-                 where n is the total number of ratings predicted in the test
form. Mahout is a production-level, open-source, system                run.
and consists of a wide range of applications that are useful
for a recommender system developer: collaborative filtering            3.2   Making Mahout Fit for Purpose
algorithms, data clustering, and data classification. Mahout
                                                                          Through personal email communication with one of the
is also highly scalable and is able to support distributed pro-
                                                                       Mahout developers, we were informed that Mahout intended
cessing of large data sets across clusters of computers using
                                                                       to provide basic rating prediction and similarity weighting
Hadoop6 . Mahout recommenders support various similarity
                                                                       capabilities for its recommenders and that it would be up
and neighborhood formation calculations, recommendation
                                                                       to developers to provide more elaborate approaches. Sev-
prediction algorithms include user-based, item-based, Slope-
                                                                       eral changes were made to the prediction algorithms and
One and Singular Value Decomposition (SVD), and it also
                                                                       the similarity weighting techniques for both the user-based
incorporates Root Mean Squared Error (RMSE) and Mean
                                                                       and item-based recommenders in order to meet our specific
Absolute Error (MAE) evaluation methods. Mahout is read-
                                                                       requirements and to match the best practices found in the
ily extensible and provides a wide range of Java classes for
                                                                       literature, as follows:
customization. As an open-source project, the Mahout de-
                                                                          Similarity weighting: Defined as Significance Weighting in
veloper/contributor community is very active; the Mahout
                                                                       [5], this consists of the following method:
wiki also provides a list of developers and a list of websites
that have implemented Mahout7 .                                                  scaleFactor = count/50.0;
                                                                                 if (scaleFactor > 1.0) scaleFactor = 1.0;
3.1    Uncovering Mahout Details                                                 result = scaleFactor * result;
   Although Mahout is rich in documentation, there are im-             where count is the number of co-rated items between two
plementation details on how Mahout works that could only               users, and result is the calculated Pearson Correlation co-
be understood by looking at the source code. Thus, for clar-           efficient.
ity in evaluation, we needed to verify the implementation                 User-user mean-centered prediction: After identifying a
of baseline platform functionality. The following describes            neighborhood of similar users, a prediction, as documented
some of these details for Mahout 0.4 ‘out-of-the-box’:                 in [8, 5, 1], is computed for a target item i and target user
5
                                                                       u as follows:
  http://easyrec.org/
6
                                                                                                  P
  http://hadoop.apache.org/                                                                             simu,v (rv,i − rv )
7                                                                                     pu,i = ru + vV                            (2)
  https://cwiki.apache.org/MAHOUT/mahout-wiki.html
                                                                                                      P
                                                                                                        vV | simu,v |


                                                                  46
where V is the set of k similar users who have rated item i,             highest possible value) and view these metrics on a rela-
rv,i is the rating of those users who have rated item i, ru is           tive basis, i.e., how much the metric has increased or de-
the average rating for the target user u over all rated items,           creased beyond a baseline value based on empirical results.
rv is the average rating for user v over all co-rated items,             Furthermore, the interplay between accuracy and coverage,
and simu,v is the Pearson correlation coefficient.                       i.e., coverage decreases as a function of accuracy [4, 3], cre-
  Item-item mean-centered prediction: A prediction, as doc-              ates a trade-off for recommender system implementers that
umented in [1], is computed for a target item i and target               has been discussed previously but not been developed thor-
user u as follows:                                                       oughly. Inspired by the suggestion in [6] to combine the cov-
                         P                                               erage and accuracy measures to yield an overall “practical
                           jNu (i) simi,j (ru,j − rj )
             pu,i = ri +     P                             (3)           accuracy” measure for the recommender system, we devel-
                               jNu (i) | simi,j |                       oped a straightforward “AC Measure” that combines both
where Nu (i) is the set of items rated by user u most similar            accuracy and coverage into a single metric as follows:
to item i, ru,j is u’s rating of item j, rj is the average rating                                     Accuracyi
for item j over all users who rated item j, ri is the average                                 ACi =             ,                   (5)
                                                                                                      Coveragei
rating for target item i, and simi,j is the Pearson correlation
coefficient.                                                             where i indicates the ith trial in an evaluation experiment.
   Item-item similarity thresholding: This method was added
to Mahout and used in conjunction with the item-item mean-
centered prediction described above. Similarity threshold-
ing, as described in [5], defines a level of similarity that is
required for two items to be considered similar for purposes
of making a recommendation prediction; item-item similar-
ities that are less than the threshold are not used in the
prediction calculation.
   Coverage and combined accuracy/coverage metric: As sug-
gested in [6], the easiest way to measure coverage is to select
a random sample of user-item pairs, ask for a prediction for
each pair, and measure the percentage for which a predic-
tion was provided. To calculate coverage, code changes were
made to Mahout to provide, for each test run, the total num-
ber of rating predictions requested that were unable to be
calculated as well as the total of number of rating predic-
tions requested that were actually calculated; the sum of
these two numbers is the total number of ratings requested.
Coverage was calculated as follows:                                            Figure 1: Illustration of the AC Measure
                     T otal#RatingsCalculated
          Coverage =                                         (4)            The AC Measure simply adjusts (upward) the Accuracy
                     T otal#RatingsRequested
                                                                         according to the level of Coverage metrics found in an ex-
Code changes were also made to calculate a combined accu-                perimental trial and is agnostic to the accuracy metric used,
racy and coverage metric as defined in Section 4.                        e.g., MAE or RMSE. Using a family of curves for the Mean
                                                                         Absolute Error (MAE) accuracy metric, Figure 1 illustrates
4.   ACCURACY AND COVERAGE METRIC                                        the relationship between accuracy, coverage, and the AC
   The metrics selected for this case study, accuracy and cov-           Measure. As an example, following the “M AE : 0.5’’ curve
erage, were chosen because they are fundamental to the util-             we see that at 100% coverage, the AC Measure is 0.5, and
ity of a recommender system [10, 6]. Although other metrics              at 10% coverage, the AC Measure has increased to 5. The
such as novelty and serendipity can, and should, be used in              intuition behind this metric is that when the recommender
conjunction with accuracy and coverage, our objective was                system is able to provide predictions for a high percentage
to evaluate the very basic requirements of a recommender                 of items in the dataset, the accuracy metric more closely
system. Our implementation of coverage, referred to as pre-              indicates the level of system performance; conversely, when
diction coverage in [6], measures the percentage of a dataset            the coverage is low, the accuracy metric is “penalized” and is
for which the recommender system is able to provide predic-              adjusted upwards. We believe that the major benefit of the
tions. High coverage would indicate that the recommender                 AC Measure is that it formulates a solution for addressing
system is able to provide predictions for a large number of              the trade-off between accuracy and coverage and can be used
items and is considered to be a desirable characteristic of              to create a ranked list of results (low to high) from multiple
the recommender system [6]. A combination of high accu-                  experimental trials to find the best (lowest) AC Measure for
racy (low error rate) and high coverage are indeed desirable             each set of test conditions. The simplified visualization of
by users and system operators because it improves the util-              the combined AC Measure shown in Figure 1 is an additional
ity or usefulness of the system from a user standpoint [10,              benefit. For our evaluation purposes, the use of a combined
6].                                                                      metric was ideal in addressing the inherent trade-offs be-
   What constitutes ‘good’ accuracy or coverage, however,                tween accuracy and coverage, especially in the cases where
has not been well defined in the literature: studies such                accuracy is found to be high when coverage is low; we posit
as [10, 4, 5] and many others, endeavor to maximize accu-                that the AC Measure will also be useful for other researchers
racy (achieve lowest possible value) and/or coverage (achieve            performing evaluations using accuracy and coverage.


                                                                    47
5.      EXPERIMENTAL DESIGN                                            ML100K dataset, the training set was 70% of the data, the
  The objective of this case study was to understand Ma-               test set was 30% of the data, and 100% of the user data was
hout’s baseline collaborative filtering algorithms and evalu-          used; a total 30K rating predictions from 943 users were re-
ate functional changes made to the platform using accuracy             quested for each test set. For the tests using the ML10M
and coverage metrics. The main intent of making functional             dataset, the training set was 95% of the data, the test set
changes to Mahout recommender algorithms was to bring                  was 5% of the data, and 5% of the user data was used; a
the Mahout algorithms in line with best practices found in             total 25K rating predictions from 3180 users were requested
the literature. Therefore, the overall hypothesis to be tested         for each test set.
in this case study was that the modified algorithms improve
Mahout’s ‘out-of-the-box’ prediction accuracy for both user-
                                                                        5.1.4    Test Variations
based and item-based recommenders while maintaining rea-                  Various similarity thresholds and kNN neighborhood sizes
sonable coverage.                                                      were executed for each test case in order to understand and
                                                                       evaluate the corresponding behavior of the recommenders.
5.1      Datasets and Algorithms                                       For User-based recommender testing, similarity thresholds
   The data used in this study were the MovieLens datasets             of 0.0, 0.1, 0.3, 0.5, and 0.7 and kNN neighborhood sizes of
downloaded from GroupLens Research8 : the 100K dataset                 600, 400, 200, 100, 50, 20, 10, 5, and 2 were tested. For
with 100,000 ratings for 1,682 movies and 943 users (re-               Item-based recommender testing, in addition to using no
ferred to as ML100K in this study) and the 10M dataset                 similarity thresholding, similarity thresholds of 0.0, 0.1, 0.2,
with 10,000,000 ratings for 10,681 movies and 69,878 users             0.3, 0.4, 0.5, 0.6, and 0.7 were tested.
(referred to as ML10M in this study). Ratings provided in
these datasets consist of integer values between 1 (did not            6.    RESULTS AND DISCUSSION
like) to 5 (liked very much).
   For User-based (see §3.1), Mahout uses Pearson Corre-               6.1      ML10M Results
lation similarity (with and without similarity weighting),                Figures 2 and 3 show the results of test cases 1 through
Neighborhood formation (similarity thresholding or kNN),               6 for user and item-based algorithms, respectively10 . The
and Weighted Average prediction. This was tested against               key results of the experiment, for both user-based and item-
a modified algorithm (see §3.2) consisting of Pearson Cor-             based algorithms unless otherwise noted, were as follows:
relation similarity (with and without similarity weighting),              1. MAE for mean-centered prediction with significance
Neighborhood formation (similarity thresholding or kNN),               weighting is a significant improvement (p<0.01) over MAE
and Mean-centered prediction. For Item-based (see §3.1),               for Mahout prediction, regardless of weighting, across simi-
Mahout uses Pearson Correlation similarity (with and with-             larity thresholds (except item-based at similarity threshold
out similarity weighting), no Neighborhood formation, and              of 0.7) and kNN neighborhood sizes (except user-based at
Weighted Average prediction. This was tested against a                 kNN of 2, not shown).
modified algorithm (see §3.2) consisting of Pearson Corre-                2. Mahout similarity weighting does not significantly im-
lation similarity (with and without similarity weighting),             prove (p<0.01) Mahout prediction MAE over prediction with
Neighborhood formation (similarity thresholding), and Mean-            no similarity weighting (except Mahout prediction for user-
centered prediction.                                                   based and item-based at a similarity threshold of 0.4, not
5.1.1      Test Cases                                                  shown). This would indicate that Mahout similarity weight-
                                                                       ing is not very effective as a weighting technique, especially
   In order to test the overall hypothesis, the following test
                                                                       as compared to significance weighting.
cases were developed and executed for both user-based and
item-based recommenders using the ML100K and ML10M                     6.2      ML100K Results
datasets:
                                                                          The results and trend lines for the ML100K experiment
    1. Mahout Prediction, No weighting                                 are similar to ML10M. The key results, for both user-based
    2. Mahout Prediction, Mahout weighted                              and item-based algorithms unless otherwise noted, were:
    3. Mahout Prediction, Significance weighted                           1. MAE for mean-centered prediction with significance
    4. Mean-Centered Prediction, No weighting                          weighting is a significant improvement (p<0.01) over MAE
    5. Mean-Centered Prediction, Mahout weighted                       for Mahout prediction, regardless of weighting, across simi-
    6. Mean-Centered Prediction, Significance weighted                 larity thresholds and kNN neighborhood sizes (except user-
5.1.2      Accuracy and Coverage Metrics                               based at kNN of 400).
                                                                          2. Mahout similarity weighting does not significantly im-
  We used Mahout’s MAE evaluator to measure the accu-                  prove (p<0.01) Mahout prediction MAE over prediction with
racy of the rating predictions. For prediction coverage, we
used dataset training data to estimate the rating predictions           training set and a test set, and the partitioning is performed
for the test set; the random sample of user-item pairs in our           by randomly selecting some ratings from all, or some, of the
                                                                        users. The selected ratings constitute the test set, while the
testing was 30K pairs for ML100K and 25K pairs for ML10M                remaining ones are the training set.
(see §3.2). AC Measures were calculated for all test cases.            10
                                                                          The following curves are superimposed over each other be-
                                                                        cause the values are very similar: MAE results for mean-
5.1.3      Dataset Partitioning                                         centered prediction (no weighting and Mahout weighted),
  The Mahout evaluator creates holdout 9 partitions accord-             MAE results for Mahout prediction (No weighting and
ing to a set of run-time parameters. For the tests using the            Mahout weighted), Coverage results for Mahout predic-
8
                                                                        tion and mean-centered prediction (No weighting and Ma-
    http://www.grouplens.org                                            hout weighted), Coverage results for Mahout prediction and
9
    Holdout is a method that splits a dataset into two parts, a         mean-centered prediction (both Significance weighted).


                                                                  48
Figure 2: User-based Mahout Recommender Re-                          Figure 3: Item-based Mahout Recommender Re-
sults for ML10M, Test cases 1 through 6                              sults for ML10M, Test cases 1 through 6


no similarity weighting (except Mahout prediction for user-          item-based algorithms using ML10M, respectively. Rather
based and item-based at a similarity threshold of 0.4).              than show all 30 results for each algorithm (5 similarity
                                                                     thresholds x 2 prediction methods x 3 weighting types), we
6.3    Discussion                                                    show only the results with calculated AC Measure values
   As hypothesized, results for both of the ML100K and               less than 1.0; therefore, the lowest MAE results reported
ML10M experiments show significant improvements in MAE               above for user-based and item-based algorithms are clearly
using the mean-centered prediction algorithm with signifi-           beyond the range of this chart. We found that the best
cance weighting compared to the Mahout baseline predic-              combined accuracy/coverage results were found at higher
tion algorithm. However, when coverage is considered, the            levels of coverage and lower levels of similarity threshold,
“best” MAE results may need a second look. Can an MAE                i.e., the best (lowest) AC Measure for user-based was 0.688
of 0.5 or less be considered “good” when the associated cov-         at a similarity threshold of 0.1 and for item-based was 0.665
erage is in the single digits? In this case, the recommender         at a similarity threshold of 0.0, both using mean-centered
system may only be able to provide recommendations to a              prediction and significance weighting. We can also see that,
very small subset of its users and is a situation that must          with few exceptions, mean-centered prediction is improved
be avoided by system operators. To help address the ac-              over the Mahout prediction for the same similarity weight-
curacy vs. coverage trade-off, combined measures such as             ing and similarity threshold. We observed similar results
the AC Measure (Section 4), can help by considering both             using the ML100K dataset where the best (lowest) AC Mea-
accuracy and coverage simultaneously. For the ML10M ex-              sure for user-based was 0.765 and for item-based was 0.746,
periment, we determined that the lowest MAE for the User-            both at a similarity threshold of 0.0 and both using mean-
based algorithm using mean-centered prediction with sig-             centered prediction and significance weighting. These re-
nificance weighting was 0.578 at a similarity threshold of           sults demonstrate that the “best” MAE may not always be
0.7 and coverage of 0.833%; the AC Measure for this result           the lowest MAE, especially when coverage is also considered;
is calculated as 69.42. Similarly, the lowest MAE for the            furthermore, recommender system settings such as similarity
Item-based algorithm using mean-centered prediction with             weighting and neighborhood size also need to be considered
significance weighting was 0.371 at a similarity threshold of        during system evaluation.
0.7 and coverage of 1.02%; the AC Measure for this result is            Other observations of our experiments that match results
calculated as 36.32. In each of these cases, the exceedingly         reported in [5] and serve to validate our evaluation and in-
high values for the AC Measure indicate that these results           crease our confidence in the results are: (a) In general, signif-
are not very desirable in a recommender system.                      icance weighting improves prediction MAE, as compared to
   Figures 4 and 5 show the AC Measure results for user and          predictions using Mahout similarity weighting or no similar-


                                                                49
                                                                      and that our combined measure will prove useful in evalu-
                                                                      ating algorithm changes for the inherent trade-offs between
                                                                      accuracy and coverage.

                                                                      8.   REFERENCES
                                                                       [1] C. Desrosiers and G. Karypis. A comprehensive survey
                                                                           of neighborhood-based recommendations methods. In
                                                                           F. Ricci, L. Rokach, B. Shapira, and P. B. Kantor,
                                                                           editors, Recommender Systems Handbook. Springer,
                                                                           2011.
                                                                       [2] M. D. Ekstrand, M. Ludwig, J. A. Konnstan, and
                                                                           J. T. Riedl. Rethinking the recommender research
                                                                           ecosystem: Reproducibility, openness, and lenskit. In
                                                                           Proceedings of the 5th ACM Recommender Systems
                                                                           Conference (RecSys ’11), October 2011.
                                                                       [3] M. Ge, C. Delgado-Battenfeld, and D. Jannach.
Figure 4: AC Measure for selected User-based re-                           Beyond accuracy: Evaluating recommender systems
sults (lower is better)                                                    by coverage and serendipity. In Proceedings of the 4th
                                                                           ACM Recommender Systems Conference (RecSys ’10),
                                                                           September 2010.
                                                                       [4] N. Good, J. B. Schafer, J. A. Konstan, A. Borchers,
                                                                           B. Sarwar, J. Herlocker, and J. Riedl. Combining
                                                                           collaborative filtering with personal agents for better
                                                                           recommendations. In Proceedings of the 16th National
                                                                           Conference on Artificial Intelligence (AAAI-99), July
                                                                           1999.
                                                                       [5] J. L. Herlocker, J. A. Konstan, A. Borchers, and
                                                                           J. Riedl. An algorithmic framework for performing
                                                                           collaborative filtering. In Proceedings of the ACM
                                                                           SIGIR Conference, 1999.
                                                                       [6] J. L. Herlocker, J. A. Konstan, L. G. Terveen, and
                                                                           J. Riedl. Evaluating collaborative filtering
                                                                           recommender systems. ACM Transactions on
                                                                           Information Systems, 22(1):5–53, 2004.
                                                                       [7] S. Mcnee, J. Riedl, and J. Konstan. Accurate is not
                                                                           always good: How accuracy metrics have hurt
Figure 5: AC Measure for selected Item-based re-                           recommender systems. In Proceedings of the
sults (lower is better)                                                    Conference on Human Factors in Computing
                                                                           Systems(CHI 2006), April 2006.
                                                                       [8] P. Resnick, N. Iacovou, M. Suchak, P. Bergstrom, and
ity weighting; (b) As the similarity threshold increases, MAE              J. Riedl. GroupLens: an open architecture for
for mean-centered prediction with significance weighting im-               collaborative filtering of netnews. In Proceedings of the
proves and coverage degrades, whereas MAE and coverage                     ACM CSCW Conference, 1994.
both degrade for Mahout prediction with Mahout weighting;              [9] B. Sarwar, G. Karypis, J. Konstan, and J. Reidl.
(c) Coverage decreases as neighborhood size decreases.                     Item-based collaborative filtering recommendation
                                                                           algorithms. In Proceedings of the World Wide Web
7.   CONCLUSION                                                            Conference, 2001.
   Our case study of Mahout as a recommender system plat-             [10] B. M. Sarwar, J. A. Konstan, A. Borchers,
form highlights evaluation considerations for developers and               J. Herlocker, B. Miller, and J. Riedl. Using filtering
also shows how straightforward functional enhancements im-                 agents to improve prediction quality in the grouplens
proves the performance of the baseline platform. We eval-                  research collaborative filtering system. In Proceedings
uated our changes against current Mahout functionality us-                 of the ACM 1998 Conference on Computer Supported
ing accuracy and coverage metrics not only to assess base-                 Cooperative Work (CSCW ’98), November 1998.
line results, but also to provide a view of the trade-offs be-        [11] C. E. Seminario and D. C. Wilson. Robustness and
tween accuracy and coverage resulting from using different                 accuracy tradeoffs for recommender systems under
recommender algorithms. We reported cases where the low-                   attack. In Proceedings of the 25th Florida Artificial
est MAE accuracy results were not necessarily always the                   Intelligence Research Society Conference
‘best’ when coverage results were also considered, and we                  (FLAIRS-25), May 2012.
instrumented Mahout for a combined accuracy and cover-                [12] G. Shani and A. Gunawardana. Evaluating
age metric (AC Measure) to evaluate these trade-offs more                  recommendation systems. In F. Ricci, L. Rokach,
directly. We believe that this case study will provide use-                B. Shapira, and P. B. Kantor, editors, Recommender
ful guidance in using Mahout as a recommender platform,                    Systems Handbook. Springer, 2011.


                                                                 50