<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>September</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Leveraging Multi-Method Evaluation for Multi-Stakeholder Setings</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Christine Bauer</string-name>
          <email>christine.bauer@jku.at</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Eva Zangerle</string-name>
          <email>eva.zangerle@uibk.ac.at</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Johannes Kepler University Linz</institution>
          ,
          <country country="AT">Austria</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University of Innsbruck</institution>
          ,
          <country country="AT">Austria</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2019</year>
      </pub-date>
      <volume>19</volume>
      <issue>2019</issue>
      <abstract>
        <p>In this paper, we focus on recommendation settings with multiple stakeholders with possibly varying goals and interests, and argue that a single evaluation method or measure is not able to evaluate all relevant aspects in such a complex setting. We reason that employing a multi-method evaluation, where multiple evaluation methods or measures are combined and integrated, allows for getting a richer picture and prevents blind spots in the evaluation outcome.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>INTRODUCTION</title>
      <p>
        In recommender systems (RS) research, we observe a strong focus
on advancing systems such that they accurately predict items that
an individual user may be interested in. The approach of evaluating
an RS is thereby largely focused on system-centric methods and
metrics (e.g., recall and precision in leave-n-out analyses [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]). By
employing such an evaluation approach and aiming at optimizing
these metrics, the following crucial components in the ecosystem
are neglected [
        <xref ref-type="bibr" rid="ref6 ref9">6, 9</xref>
        ]: (i) multiple stakeholders are embedded in the
ecosystem, but current research largely considers merely the role
of the end consumer; (ii) the stakeholders typically have diverging
interests and objectives for an RS; however, accurately predicting a
user’s interests is the predominant focus in current RS research; and
(iii) with taking a mainly accuracy-driven, system-centric approach
to evaluation, many aspects that determine a user’s experience with
an RS are not considered [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]. This results in an incomplete picture
of user experience, leaving “blind spots” that are not captured in the
quality evaluation of an RS. Although studies [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] could show that a
lower accuracy rate may increase the business utility (e.g., revenue)
without any significant drop in user satisfaction, the objectives and
interests of stakeholders other than the user are typically not the
focus of research in academic settings in the RS community.
      </p>
      <p>In this paper, we call for considering the multiple stakeholders
in RS evaluation and postulate that only taking a multi-method
evaluation approach allows for capturing and assessing the various
interests, objectives, and experiences of these very stakeholders;
thus, contributing to eliminating the blind spots in RS evaluation.</p>
    </sec>
    <sec id="sec-2">
      <title>RELATED WORK</title>
      <p>
        The idea to combine diferent research methods is not a new one.
The concept of mixed methods research [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], for instance,
combines quantitative and qualitative research approaches. It has been
termed the third methodological paradigm, with quantitative and
qualitative methods representing the first and second paradigm
respectively [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ]. Yet, it seems that mixed methods research appears
to attract considerable interest but is rarely brought into practice [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
From a practical point of view, the reasons for the low adoption of
evaluations leveraging multiple methods are manifold, including
higher costs, higher complexity, wider skill requirements compared
to adopting one method only [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ].
      </p>
      <p>
        For RS research, Gunawardana and Shani [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] point out that
there is an extensive number of aspects that may be considered
when assessing the performance of a recommendation algorithm.
Indeed, already early research on RS pointed towards the wide
variety of metrics available for system-centric RS evaluation,
including classification metrics, predictive metrics, coverage metrics,
confidence metrics, and learning rate metrics [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ]. As
accuracydriven evaluation has been shown not to be able to capture all
the aspects that are relevant for user satisfaction [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ], more
userrelevant metrics and measures have been introduced and considered
over time [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] (so-called “quality factors beyond accuracy” [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]).
This wider range of objectives includes qualities such as novelty,
serendipity [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], or diversity [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ].
      </p>
      <p>
        Kohavi et al. [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ] stress the importance of applying multiple
metrics also in the field of A/B testing and online experiments,
pointing out that diferent metrics reflect diferent concerns. For
A/B testing in RS research, Ekstrand and Willemsen [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] emphasize
the need to include methods and metrics that go beyond the typical
A/B behavior metrics. They argue that the currently dominating RS
evaluation based on implicit feedback and A/B testing (they refer to
this combination as “behaviorism”) is often very limited in its ability
to explain why users acted in a particular way. They emphasize that
experiments need to be thoroughly grounded in theory and point to
the advantages of collecting subjective responses from users which
may help to explain their behavior.
      </p>
      <p>
        Jannach and Adomavicius [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] point out that academic research
in the field of RS tends to focus on the consumer’s perspective
with the goal to maximize the consumer’s utility (measured in
terms of the most accurate items for a user), while maximizing the
provider’s utility (e.g., in terms of profit) appears to be neglected.
While industry research on RS will naturally build around the
provider’s perspective, publications in this area are scarce [
        <xref ref-type="bibr" rid="ref25">25</xref>
        ].
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>DIGITAL MUSIC STAKEHOLDERS</title>
      <p>
        Various stakeholders are involved in the digital music value chain [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
From songwriters who create songs; to performers (e.g., (solo)
artists, bands); to music producers who take a broad role in the
production of a track; to record companies, including the three major
labels; to music platform providers with huge repositories of music
tracks, acting at the interface to music consumers; and hundreds
of millions of music consumers with diferent music preferences
and various objectives for using RS (e.g., discovering previously
unknown items, rediscovering items not having listened to in a
while ); to society at large with its social, economic, and political
objectives and needs.
      </p>
      <p>
        Some stakeholders focus on user experience, where the goal is to
propose “the right music, to the right user, at the right moment” [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ].
Other stakeholders have business-oriented utility functions [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. For
instance, artists will most likely want to have their own songs
recommended to consumers. While some artists may be fine with
any of their songs being recommended, others may prefer to
increase the playcount of a particular song (e.g., to reach the top
charts, which would open an opportunity to draw an even broader
audience; or some song may generate higher revenues than
others due to contract rules). Achieving additional 1,000 playcounts
will not get apparent for highly popular artists with yearly
playcounts of several billions, but could be an important milestone for
a comparatively less popular (local) artist.
4
      </p>
    </sec>
    <sec id="sec-4">
      <title>BALANCING STAKEHOLDER INTERESTS</title>
    </sec>
    <sec id="sec-5">
      <title>IN EVALUATION</title>
      <p>In the following, we aim to make the case for multi-method
evaluations that contribute to identifying the strong and weak spots of
a music RS for the stakeholders involved, where we focus on the
users’ and artists’ perspectives in this section.</p>
      <p>
        From a user perspective, recommendations that are adequate
in terms of system-centric measures—e.g., the predictive accuracy
of recommendation algorithms—do not necessarily meet a user’s
expectations [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ]. User-centric evaluation methods, in contrast,
involve users who interact with an RS [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] to gather user feedback [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]
either implicitly or explicitly (depending on the concrete evaluation
design). Such methods measure a user’s perceived quality of the
RS at the time of recommendation, e.g., by established
questionnaires [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ]. Still, relying solely on user-centric methods does not
reveal the accuracy of the recommendations, because, given the
vast amount of items, users are not able to judge whether a given
recommendation was indeed the most relevant one [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
      </p>
      <p>
        Measuring accuracy does not capture the recommendations’
usefulness for users, because higher accuracy scores do not necessarily
imply higher user satisfaction [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ]. For instance, a user’s most
favorite song is an accurate prediction; still, repeating the same song
ifve times is, though accurate, likely not a satisfying experience.
Hence, we argue that for evaluating the user’s perspective of a
RS—the user being only one of the many stakeholders involved—
multiple evaluation methods and measures are required. This may
include combining a set of diferent measures (ranging from recall
and precision to serendipity, list diversity or novelty) or integrating
diferent evaluation methods (ranging from leave-n-out ofline
experiments to user studies and A/B testing). Furthermore, although
A/B-testing using user’s implicit feedback is efective for testing the
impact of diferent algorithms or designs on user behavior—and is,
thus, frequently considered the “golden standard” for recommender
evaluation—, it has limited ability in explaining why users acted in
a particular way [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. Additional information (e.g., users’ subjective
responses) is necessary to allow for explaining behavior.
      </p>
      <p>
        In short, sticking to a single evaluation method narrows our view
on the RS, literally having blinders on, while devising and
evaluating RS. We can borrow from social and behavioral sciences, where,
e.g., mixed-methods research combines quantitative and qualitative
evaluations using diferent designs [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. Creswell’s proposed designs
include—among others—the convergent parallel design and the
sequential design. In the convergent parallel design, two evaluation
methods are first applied in parallel, and finally integrated into a
single interpretation. The sequential design uses sequential timing,
employing the methods in distinct phases. The second phase of the
study, using the second method, is designed such that it follows
from the results of the first phase. Depending on the research goal
and the concrete choice of methods, researchers may either
interpret how the second phase’s results help to explain the initial results
(explanatory design) or they build on the exploratory results of the
ifrst phase to subsequently employ a diferent method (in the second
phase) to test or generalize the initial findings (exploratory design).
For instance, Kamehkhosh and Jannach [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] showed that in the field
of music RS, the results of a conducted ofline evaluation could be
reproduced with online studies assessing the users’ perceived
quality of recommendations. Similarly, for the Recommender Systems
Challenge 2017, participants firstly evaluated their prototypes in
ofline evaluations, before actually deploying them and evaluating
them in the live system utilizing A/B tests [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], showing that many
of RS who performed well in ofline evaluations were able to repeat
this in online experiments. However, some of the devised RS also
performed substantially worse in online experiments—highlighting
a shortcoming that was not revealed by evaluating from solely
an ofline perspective. Along the same lines, Ekstrand and
Willemsen [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] state that utilizing behaviorism for evaluation purposes (e.g.,
through A/B tests) is not suficient to understand why users act in a
particular way and, for instance, like a particular recommendation.
      </p>
      <p>
        While academic research in the field of RS tends to focus on
maximizing the users’ utility, some authors (e.g., [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]) emphasize the
importance of profit (or value) maximization. Profit maximization
may not only be a goal for platform providers, but also for artists
who are the content providers for music platforms. From an artist’s
perspective, a good RS recommends the respective artist’s songs
suficiently frequently, which may ultimately lead to playcounts,
likes, purchases, profit maximization, etc. Evaluating for profit
may, though, leave blind spots. For example, depending on the
chosen strategy, an artist may want to emphasize other values
such as expanding the audience (thus, reaching new listeners) or
increasing the listening or purchase volume within the current fan
base. Hence, metrics such as number of unique listeners per artist,
2
the sum of playcounts over all songs of an artist, and metrics such as
profit-per-audience type may be valuable for RS optimization and
need to be considered in the RS evaluation strategy. Accordingly,
evaluation eforts need to elicit and integrate the artists’ goals and
preferences need to be elicited and integrated into the evaluation
eforts. While evaluation on a per-artist-basis might be interesting
for the individual artists (e.g., for a comparison between platforms
and their integrated RS), it may not be adequate for an overall
RS evaluation. Still, an RS needs to be evaluated for its ability to
serve the various strategies and for revealing potential tendencies
towards the one or other strategy. As the targeted strategy might
correlate with artist characteristics (e.g., top-of-the-top artists vs.
“long tail” artist; early career vs. come-back phase vs. long-term
career; mainstream artists vs. niche genres), it might be in the
society’s interest to evaluate for and ensure a balance.
      </p>
      <p>Having given these examples, we emphasize that, due to
interdependencies between the RS and the various stakeholders’ actions,
the entire RS ecosystem has to be taken into account in the
evaluation. For instance, low accuracy of recommendations and low user
experience are not likely to continuously increase profits for the
platform provider and all kinds of artists; high accuracy does not
automatically imply high user experience and may not contribute
to profit maximization.
5</p>
    </sec>
    <sec id="sec-6">
      <title>CONCLUSIONS</title>
      <p>In this position paper, we exemplarily focused on the digital music
ecosystem to illustrate that multiple stakeholders are impacted by
music RS, and discussed the opportunities of multi-method
evaluations to consider the multiple stakeholders’ perspectives. We
emphasize that—irrespective of the application domain—there are always
multiple stakeholders involved in recommendation settings. Hence,
there are always multiple—and possibly diverging—perspectives
and goals of these very stakeholders which need to be considered
in evaluating an RS. Consequently, multiple evaluation methods
and criteria have to be combined and possibly also weighted.</p>
      <p>
        Multi-method evaluations allow for gathering a richer and more
integrated picture of the quality of a RS and contributes to
understanding the various phenomena involved in a multi-stakeholder
setting, for which one method in isolation would be insuficient [
        <xref ref-type="bibr" rid="ref24">24</xref>
        ].
      </p>
    </sec>
    <sec id="sec-7">
      <title>ACKNOWLEDGMENTS</title>
      <p>This research is supported by the Austrian Science Fund (FWF):
V579.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Himan</given-names>
            <surname>Abdollahpouri</surname>
          </string-name>
          and
          <string-name>
            <given-names>Steve</given-names>
            <surname>Essinger</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Multiple stakeholders in music recommender systems</article-title>
          .
          <source>In 1st International Workshop on Value-Aware and Multistakeholder Recommendation at RecSys</source>
          <year>2017</year>
          (
          <article-title>VAMS '17)</article-title>
          . arXiv:
          <volume>1708</volume>
          .
          <fpage>00120</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Fabian</given-names>
            <surname>Abel</surname>
          </string-name>
          , Yashar Deldjoo, Mehdi Elahi, and
          <string-name>
            <given-names>Daniel</given-names>
            <surname>Kohlsdorf</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Recsys challenge 2017: Ofline and online evaluation</article-title>
          .
          <source>In Proceedings of the Eleventh ACM Conference on Recommender Systems. ACM</source>
          , New York, NY, USA,
          <fpage>372</fpage>
          -
          <lpage>373</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Pär</surname>
            <given-names>J</given-names>
          </string-name>
          <string-name>
            <surname>Ågerfalk</surname>
          </string-name>
          .
          <year>2013</year>
          .
          <article-title>Embracing diversity through mixed methods research</article-title>
          .
          <source>European Journal of Information Systems</source>
          <volume>22</volume>
          ,
          <issue>3</issue>
          (
          <year>2013</year>
          ),
          <fpage>251</fpage>
          -
          <lpage>256</lpage>
          . https://doi.org/ 10.1057/ejis.
          <year>2013</year>
          .6
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Amos</given-names>
            <surname>Azaria</surname>
          </string-name>
          , Avinatan Hassidim, Sarit Kraus, Adi Eshkol, Ofer Weintraub, and
          <string-name>
            <given-names>Irit</given-names>
            <surname>Netanely</surname>
          </string-name>
          .
          <year>2013</year>
          .
          <article-title>Movie Recommender System for Profit Maximization</article-title>
          .
          <source>In Proceedings of the 7th ACM Conference on Recommender Systems (RecSys '13)</source>
          . ACM, New York, NY, USA,
          <fpage>121</fpage>
          -
          <lpage>128</lpage>
          . https://doi.org/10.1145/2507157.2507162
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Joeran</given-names>
            <surname>Beel</surname>
          </string-name>
          , Marcel Genzmehr, Stefan Langer, Andreas Nürnberger, and
          <string-name>
            <given-names>Bela</given-names>
            <surname>Gipp</surname>
          </string-name>
          .
          <year>2013</year>
          .
          <article-title>A Comparative Analysis of Ofline and Online Evaluations and Discussion of Research Paper Recommender System Evaluation</article-title>
          .
          <source>In Proceedings of the International Workshop on Reproducibility and Replication in Recommender Systems Evaluation (RepSys '13)</source>
          . ACM, New York, NY, USA,
          <fpage>7</fpage>
          -
          <lpage>14</lpage>
          . https://doi. org/10.1145/2532508.2532511
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Robin</given-names>
            <surname>Burke</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Multisided fairness for recommendation</article-title>
          .
          <source>In 4th Workshop on Fairness, Accountability, and Transparency in Machine Learning (FAT/ML '17)</source>
          . arXiv:
          <volume>1707</volume>
          .
          <fpage>00093</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Ilknur</given-names>
            <surname>Celik</surname>
          </string-name>
          , Ilaria Torre, Frosina Koceva, Christine Bauer, Eva Zangerle, and
          <string-name>
            <given-names>Bart</given-names>
            <surname>Knijnenburg</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>UMAP 2018 Intelligent User-Adapted Interfaces: Design and Multi-Modal Evaluation (IUadaptMe)</article-title>
          .
          <source>In Adjunct Publication of the 26th Conference on User Modeling, Adaptation and Personalization (UMAP '18)</source>
          . ACM, New York, NY, USA,
          <fpage>137</fpage>
          -
          <lpage>139</lpage>
          . https://doi.org/10.1145/3213586.3226202
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <surname>John</surname>
            <given-names>W.</given-names>
          </string-name>
          <string-name>
            <surname>Creswell</surname>
          </string-name>
          .
          <year>2003</year>
          .
          <article-title>Research design: qualitative, quantitative, and mixed methods approaches (2nd ed</article-title>
          .).
          <source>Sage Publications</source>
          , Thousand Oaks, CA, USA.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <surname>Michael</surname>
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Ekstrand</surname>
            and
            <given-names>Martijn C.</given-names>
          </string-name>
          <string-name>
            <surname>Willemsen</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Behaviorism is not enough: better recommendations through listening to users</article-title>
          .
          <source>In Proceedings of the 10th ACM Conference on Recommender Systems (RecSys '16)</source>
          . ACM, New York, NY, USA,
          <fpage>221</fpage>
          -
          <lpage>224</lpage>
          . https://doi.org/10.1145/2959100.2959179
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>Asela</given-names>
            <surname>Gunawardana</surname>
          </string-name>
          and
          <string-name>
            <given-names>Guy</given-names>
            <surname>Shani</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>Evaluating Recommender Systems</article-title>
          .
          <source>In Recommender Systems Handbook (2nd ed.)</source>
          ,
          <string-name>
            <surname>Francesco</surname>
            <given-names>Ricci</given-names>
          </string-name>
          , Lior Rokach, and Bracha Shapira (Eds.). Springer, Boston, MA, USA,
          <fpage>265</fpage>
          -
          <lpage>308</lpage>
          . https://doi.org/10. 1007/978-1-
          <fpage>4899</fpage>
          -7637-
          <issue>6</issue>
          _
          <fpage>8</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Jonathan</surname>
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Herlocker</surname>
          </string-name>
          , Joseph A.
          <string-name>
            <surname>Konstan</surname>
          </string-name>
          , Loren G. Terveen, and John T. Riedl.
          <year>2004</year>
          .
          <article-title>Evaluating Collaborative Filtering Recommender Systems</article-title>
          .
          <source>ACM Transaction on Information Systems 22</source>
          ,
          <issue>1</issue>
          (Jan.
          <year>2004</year>
          ),
          <fpage>5</fpage>
          -
          <lpage>53</lpage>
          . https://doi.org/10.1145/963770. 963772
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>Dietmar</given-names>
            <surname>Jannach</surname>
          </string-name>
          and
          <string-name>
            <given-names>Gediminas</given-names>
            <surname>Adomavicius</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Price and profit awareness in recommender systems</article-title>
          .
          <source>In 1st International Workshop on Value-Aware and Multistakeholder Recommendation at RecSys</source>
          <year>2017</year>
          (
          <article-title>VAMS '17)</article-title>
          . arXiv:
          <volume>1707</volume>
          .
          <fpage>08029</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>Dietmar</given-names>
            <surname>Jannach</surname>
          </string-name>
          , Paul Resnick, Alexander Tuzhilin, and
          <string-name>
            <given-names>Markus</given-names>
            <surname>Zanker</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <source>Recommender Systems - Beyond Matrix Completion. Commun. ACM</source>
          <volume>59</volume>
          ,
          <issue>11</issue>
          (
          <year>2016</year>
          ),
          <fpage>94</fpage>
          -
          <lpage>102</lpage>
          . https://doi.org/10.1145/2891406
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>Iman</given-names>
            <surname>Kamehkhosh</surname>
          </string-name>
          and
          <string-name>
            <given-names>Dietmar</given-names>
            <surname>Jannach</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>User Perception of Next-Track Music Recommendations</article-title>
          .
          <source>In Proceedings of the 25th Conference on User Modeling, Adaptation and Personalization (UMAP '17)</source>
          . ACM, New York, NY, USA,
          <fpage>113</fpage>
          -
          <lpage>121</lpage>
          . https://doi.org/10.1145/3079628.3079668
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>Marius</given-names>
            <surname>Kaminskas</surname>
          </string-name>
          and
          <string-name>
            <given-names>Derek</given-names>
            <surname>Bridge</surname>
          </string-name>
          .
          <year>2016</year>
          . Diversity, Serendipity, Novelty, and
          <article-title>Coverage: A Survey and Empirical Analysis of Beyond-Accuracy Objectives in Recommender Systems</article-title>
          .
          <source>ACM Transactions on Interactive Intelligent Systems 7, 1, Article</source>
          <volume>2</volume>
          (
          <year>2016</year>
          ),
          <volume>42</volume>
          pages. https://doi.org/10.1145/2926720
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <surname>Bart</surname>
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Knijnenburg</surname>
            and
            <given-names>Martijn C.</given-names>
          </string-name>
          <string-name>
            <surname>Willemsen</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>Evaluating Recommender Systems with User Experiments</article-title>
          .
          <source>In Recommender Systems Handbook (2nd ed.)</source>
          ,
          <string-name>
            <surname>Francesco</surname>
            <given-names>Ricci</given-names>
          </string-name>
          , Lior Rokach, and Bracha Shapira (Eds.). Springer, Boston, MA, USA,
          <fpage>309</fpage>
          -
          <lpage>352</lpage>
          . https://doi.org/10.1007/978-1-
          <fpage>4899</fpage>
          -7637-
          <issue>6</issue>
          _
          <fpage>9</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <surname>Ron</surname>
            <given-names>Kohavi</given-names>
          </string-name>
          , Alex Deng, Brian Frasca, Toby Walker, Ya Xu,
          <string-name>
            <given-names>and Nils</given-names>
            <surname>Pohlmann</surname>
          </string-name>
          .
          <year>2013</year>
          .
          <article-title>Online Controlled Experiments at Large Scale</article-title>
          .
          <source>In Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD '13)</source>
          . ACM, New York, NY, USA,
          <fpage>1168</fpage>
          -
          <lpage>1176</lpage>
          . https://doi.org/10.1145/2487575. 2488217
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <surname>Joseph</surname>
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Konstan</surname>
          </string-name>
          and John Riedl.
          <year>2012</year>
          .
          <article-title>Recommender systems: from algorithms to user experience</article-title>
          .
          <source>User Modeling and User-Adapted Interaction 22</source>
          ,
          <issue>1</issue>
          (
          <year>2012</year>
          ),
          <fpage>101</fpage>
          -
          <lpage>123</lpage>
          . https://doi.org/10.1007/s11257-011-9112-x
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>Audrey</given-names>
            <surname>Laplante</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>Improving music recommender systems: what can we learn from research on music tags?</article-title>
          .
          <source>In 15th International Society for Music Information Retrieval Conference (ISMIR '14)</source>
          .
          <source>International Society for Music Information Retrieval</source>
          ,
          <fpage>451</fpage>
          -
          <lpage>456</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <surname>Sean M. McNee</surname>
          </string-name>
          ,
          <string-name>
            <surname>John Riedl</surname>
          </string-name>
          , and
          <string-name>
            <surname>Joseph</surname>
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Konstan</surname>
          </string-name>
          .
          <year>2006</year>
          .
          <article-title>Being Accurate is Not Enough: How Accuracy Metrics Have Hurt Recommender Systems</article-title>
          .
          <source>In CHI '06 Extended Abstracts on Human Factors in Computing Systems (CHI EA '06)</source>
          . ACM, New York, NY, USA,
          <fpage>1097</fpage>
          -
          <lpage>1101</lpage>
          . https://doi.org/10.1145/1125451.1125659
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <surname>Pearl</surname>
            <given-names>Pu</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Li</given-names>
            <surname>Chen</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Rong</given-names>
            <surname>Hu</surname>
          </string-name>
          .
          <year>2011</year>
          .
          <article-title>A User-centric Evaluation Framework for Recommender Systems</article-title>
          .
          <source>In Proceedings of the 5th ACM Conference on Recommender Systems (RecSys '11)</source>
          . ACM, New York, NY, USA,
          <fpage>157</fpage>
          -
          <lpage>164</lpage>
          . https://doi.org/10. 1145/2043932.2043962
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <surname>Alan</surname>
            <given-names>Said</given-names>
          </string-name>
          , Domonkos Tikk, Klara Stumpf, Yue Shi,
          <string-name>
            <given-names>Martha</given-names>
            <surname>Larson</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Paolo</given-names>
            <surname>Cremonesi</surname>
          </string-name>
          .
          <year>2012</year>
          .
          <article-title>Recommender Systems Evaluation: A 3D Benchmark</article-title>
          .
          <source>In Proceedings of the Workshop on Recommendation Utility Evaluation: Beyond RMSE (RUE '12)</source>
          , Vol.
          <volume>910</volume>
          . CEUR Workshop Proceedings,
          <fpage>21</fpage>
          -
          <lpage>23</lpage>
          . http://ceur-ws.
          <source>org/</source>
          Vol-
          <volume>910</volume>
          /
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>Charles</given-names>
            <surname>Teddlie</surname>
          </string-name>
          and
          <string-name>
            <given-names>Abbas</given-names>
            <surname>Tashakkori</surname>
          </string-name>
          .
          <year>2009</year>
          .
          <article-title>Foundations of mixed methods research: Integrating quantitative and qualitative approaches in the social and behavioral sciences</article-title>
          .
          <source>Sage Publications</source>
          , Thousand Oaks, CA, USA.
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <surname>Viswanath</surname>
            <given-names>Venkatesh</given-names>
          </string-name>
          , Susan A.
          <string-name>
            <surname>Brown</surname>
          </string-name>
          , and Hillol Bala.
          <year>2013</year>
          .
          <article-title>Bridging the qualitative-quantitative divide: Guidelines for conducting mixed methods research in information systems</article-title>
          .
          <source>MIS Quarterly 37</source>
          ,
          <issue>1</issue>
          (
          <year>2013</year>
          ),
          <fpage>21</fpage>
          -
          <lpage>54</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <surname>Markus</surname>
            <given-names>Zanker</given-names>
          </string-name>
          , Laurens Rook, and
          <string-name>
            <given-names>Dietmar</given-names>
            <surname>Jannach</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Measuring the impact of online personalisation: Past, present and future</article-title>
          .
          <source>International Journal of Human-Computer Studies</source>
          (
          <year>2019</year>
          ). https://doi.org/10.1016/j.ijhcs.
          <year>2019</year>
          .
          <volume>06</volume>
          .006
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>