<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>EvalUMAP: Towards comparative evaluation in user modeling, adaptation and personalization</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Owen Conlan</string-name>
          <email>owen.conlan@scss.tcd.ie</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Liadh Kelly</string-name>
          <email>liadh.kelly@dcu.ie</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Kevin Koidl</string-name>
          <email>kevin.koidl@scss.tcd.ie</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Séamus Lawless</string-name>
          <email>seamus.lawless@scss.tcd.ie</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Athanasios Staikopoulos</string-name>
          <email>athanasios.staikopoulos@scss.tcd.ie</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>ADAPT Centre, Dublin City University</institution>
          ,
          <country country="IE">Ireland</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>ADAPT Centre, School of Computer Science and Statistics Trinity College Dublin</institution>
          ,
          <country country="IE">Ireland</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>There is currently no established or standardized means for the comparative evaluation of algorithms and systems developed by researchers in the User Modeling, Adaptation and Personalization (UMAP) space. The design and establishment of such methodologies has proven to be extremely difficult, but would be highly rewarding, as demonstrated by initiatives such as CLEF, TREC and NTCIR in the Information Retrieval domain. Privacy concerns, the challenges of working with interactive scenarios, and individual differences in behaviour between users must all be addressed in order to facilitate repeatable and comparable evaluation, and to advance research in this domain. In this paper we present EvalUMAP, a new concerted drive towards the establishment of shared challenges for comparative evaluation within the UMAP community.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Research in the areas of User Modelling, Adaptation and Personalization (UMAP)
faces a number of significant scientific challenges. One of the most significant of
these challenges is the issue of comparative evaluation. It has always been difficult
to rigorously compare different approaches to personalization, as the function of
the resulting systems is, by their nature, heavily influenced by the behaviour
of the users involved in trialling the systems. To-date this topic has received
relatively little attention when compared with other areas of Computer Science
research, such as Information Retrieval (IR). Developing comparative
evaluations in this space would be a huge advancement as it would enable shared
comparison across research, which to-date has been very limited.</p>
      <p>One of the significant challenges in establishing such an initiative, is that
the UMAP community encompasses a broad range of research areas and
technologies. An array of approaches to User Modeling exist, with no standardised
approach to data capture, analysis or representation. Personalised and adaptive
systems that utilise user models are also hugely varied in nature, from
personalised IR systems which tailor the selection and ranking of content to an
individual, to more complex personalised learning systems, which tailor
interaction and learning offerings to an individual based upon their competency or
performance in learning tasks.</p>
      <p>Taking inspiration from communities such as IR and Machine Translation
(MT), the EvalUMAP Workshop series3 was established in 2016 with the
ambitious goal of moving towards comparative evaluation in the UMAP community.
A specific first goal was set, to propose and design one or more shared tasks to
support the comparative evaluation of approaches to User Modelling,
Adaptation and Personalization. The long term vision is the establishment of an annual
shared challenge series, similar to TREC4 and CLEF5 in the IR space. The
establishment of such shared tasks requires that appropriate models, content,
metadata, user behaviours, etc. be available, in order to comprehensively
compare how different approaches and systems perform. In addition, a number of
metrics and observations would need to be outlined, that participants would be
expected to perform in order to facilitate comparison. This is significant. To
move towards this goal we, as a community need to greatly advance our
understanding of, and methodology associated with UMAP evaluation. Including
not only the technical challenges associated with design and implementation,
but also privacy, ethics, legal and security issues, evaluation methodologies and
metrics.</p>
      <p>When compared with shared tasks in IR, EvalUMAP aims to develop tasks
and test collections which are focused on variations in the user (represented by
variations in the underlying user model) and the personalised decisions taken by
the systems, rather than variations in the queries and/or relevance judgments
provided. The ultimate goal is to have the users who are being modeled involved
in judging the performance of the personalised systems, and thereby contributing
to the iterative enhancement of the test collections used.</p>
      <p>In the next section we provide background for the EvalUMAP initiative and
then move on to provide an overview of progress to-date towards this goal. We
conclude with a discussion of future directions.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Background</title>
      <p>
        Despite research interest and progress being made in UMAP research, it is well
understood within the community that progression has been limited by a lack
of cross comparable evaluation methods [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. This problem was highlighted
during the panel session at the UMAP 2015 conference. An outcome of which was
the need for community exerted effort in developing cross comparable evaluation
approaches in UMAP evaluation. The subsequent EvalUMAP 2016 workshop [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]
began concrete discussion on this topic.
      </p>
      <p>Currently, there are no established or standardized baselines or evaluation
metrics, and no commonly available test collections. Privacy concerns, the
chal3 http://evalumap.adaptcentre.ie/
4 http://trec.nist.gov/
5 http://clef2017.clef-initiative.eu/
lenges of working with interactive scenarios, and the individual differences in
behaviour between users all must be addressed in order to facilitate repeatable
and comparable evaluation and to advance research in this domain. While
overcoming these problems is a big challenge, there have been some notable efforts
in the past from which to build on.</p>
      <p>
        Park et. al [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] for example, propose a two phase evaluation model
consisting of a qualitative pre-screening phase followed by quantitative user-based
assessments (using objective measures) to compare various content alternatives.
Weibelzahl [21] and Chin et. al [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] on the other hand focus more on the need
for system-wide empirical evaluations determining which users were helped or
hindered by user-adapted interactions. Van Velsen et. al [20] also attempt to
produce a model summarising the variables being assessed at each stage of the
process along with relevant methods to assess them.
      </p>
      <p>
        As more methods evaluating the usefulness and accuracy of adaptive
systems appeared, the need to evaluate these evaluation methods became evident.
Klaassen et. al [19] for example evaluated three of the most common test
methods used to detect usability problems in personalised systems. More recently,
Paramythis et. al. [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] proposed to unify previous approaches presented in the
literature by introducing the Layered Evaluation framework. This approach seeks
to tackle the difficulties involved with evaluating adaptivity by decomposing such
systems into independent layers of adaptivity (such as User Model, Content
Model and Adaptive Decisions Logic). They also propose methods related to the
various development lifecycle stages of interactive systems. However, this layered
approach advocates evaluations that require a separation of concerns within the
design process that is not always possible. The main advantage however is that
it enables a clear identification of issues within elements of the design.
      </p>
      <p>As can be seen, adaptive system evaluation has been a recurrent topic within
the community over the years. Nevertheless, a solution capable of delivering
repeatable and comparable results that would become the standard method to
evaluate UMAP research has yet to emerge. Improved solutions for UMAP
evaluation that have lower cost, are more repeatable, and more realistic are required.</p>
      <p>Lessons can be learned here from progress in other domains in shared
challenge generation. The nearest to our UMAP challenge being arguably that of
the Information Retrieval (IR) community. This community has a long history
in shared challenge generation, with multiple established shared challenge
evaluation series running across the globe, namely TREC6, NTCIR7, FIRE8, CLEF9
and most recently MediaEval10. The evaluation methodology adopted by these
shared challenges primarily involves the sharing of resources with participating
teams to perform a task, for example data retrieval or annotation. Each
participating team then uses their developed technique to perform the ad hoc task
6 http://trec.nist.gov/
7 http://research.nii.ac.jp/ntcir/index-en.html
8 http://fire.irsi.res.in/fire/
9 http://clef2017.clef-initiative.eu/
10 http://www.multimediaeval.org/
using the provided data collection. Performance in the task is marked against
an organizer provided gold standard.</p>
      <p>
        These challenges traditionally considered the once off requirements of a
typical or standard user of a system. In recent years the community has started to
look more closely at bringing the user into the loop, exploring the creation of
shared challenges that consider iterative search sessions (for example in
initiatives such as [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]), providing profiles of individual users to aid search (for example,
the new PIR-CLEF task11) and providing access to real users conducting real
search tasks [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ][
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. Most recently, the Personalized Information Retrieval (PIR)
task at CLEF 2017 introduces user profiles to personalize the retrieval process
12.
      </p>
      <p>In working towards the possibility of shared challenges in the UMAP
community we can learn from such initiatives. However, the types of algorithms and
systems which the UMAP community seek to evaluate are of a distinct nature,
and as such will require their own unique solution.
3</p>
    </sec>
    <sec id="sec-3">
      <title>Outcomes of EvalUMAP 2016</title>
      <p>The 1st International EvalUMAP 2016 Workshop brought together researchers
from across the User Modelling, Adaptation and Personalization community to
explore potential new ideas and approaches to support comparative evaluations.
This area of research is identified as inherently complex, not only because of
its focus on the user, but also because of the different and diverse domains
involved. To date this has presented significant barriers to how research outcomes
can be compared. In particular, the 1st EvalUMAP workshop investigated how
the UMAP research community can facilitate the shared development of
evaluation tasks and competitions. In the EvalUMAP 2016 workshop, 10 position and
discussion papers were accepted, covering different evaluation aspects from
potential frameworks and platforms to requirements and reference models as well
as specific evaluation areas and metrics.</p>
      <p>
        More specifically, the contributions of the papers were as follows: Koidl et. al.
[
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] alleged that researchers conducting evaluations in the fields of User
Modelling and Personalisation face the challenge of missing continuing evaluation
feedback and collaboration with the overall research community. As a result,
the authors proposed a community-driven portal introduced as ECP
(Evaluation Community Portal) specifically focused on evaluations within the UMAP
community. ECP is inspired by work from the Cross-Language Evaluation
Forum (CLEF)13, and is based on the simplicity of Calls for Papers (CFP). As
a starting point, the authors proposed the following key features: a) ability to
post calls for participation in evaluations, b) ability to discuss approaches and
findings in a forum manner, c) ability to upload and present data that can be
11 http://www.ir.disco.unimib.it/pirclef2017/
12 http://www.ir.disco.unimib.it/pirclef2017/
13 http://www.clef-initiative.eu/
shared and used in other evaluations. Furthermore, an initial task force
(community champions) leading these efforts needs to be identified, that will bootstrap
the Portal and provide the initial community momentum.
      </p>
      <p>
        Vahid et. al. [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ] presented how related tasks were designed - involving
gathering users profiles and objects of their interest, in well-known IR evaluation
communities. The paper reviews and compares existing user tasks and describes
task resource collection, methods and metrics. In particular, the paper describes
two tasks a) the Contextual Suggestion Task of TREC (Text REtrieval
Conference) and b) the Social Book Search Task of CLEF. The goal of the contextual
suggestion task is to evaluate the search techniques for complex information
needs of users with respect to context and their point of interest. This task
investigates the development of systems that are able to make suggestions of
sites with the goal to explore an unknown city based upon the user’s personal
interests in the user’s home city. A set of user preferences, example suggestions
and a set of contexts are given to participants as inputs. As evaluation metrics,
Precision at Rank 5 (P@5), Mean Reciprocal Rank (MRR) and a modified
version of Time-Biased Gain (TBG) are used to rank participants runs. The Social
Book Search task investigates evaluation methodologies for a book search task
using a combination of various aspects of retrieval and recommendation dealing
with professional and user-generated meta-data. A set of book requests and a
set of user profiles have been assumed as inputs of the task and a submitted
ranked list of recommended books has been evaluated as the result of
participant’s systems. The official evaluation measure for this task is nDCG@10. It
takes graded relevance values into account and is designed for evaluation based
on the top retrieved results. In addition, P@10, MAP and MRR scores will also
be reported, with the evaluation results.
      </p>
      <p>
        Next, Pandit et. al. [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] emphasised the need to support the reproducibility
of results in a systematic way. Reproducibility of results is a key element for the
verification of scientific experiments and an important indicator of the quality of
a published experiment. Therefore, it is vital to precisely and transparently share
both the method and the data associated with an experiment. In particular, in
their paper the authors explore how emerging linked data standards such as the
P-PLAN, CSVW and DCAT ontologies can be applied to the description of the
steps and data associated with a published adaptive or personalised experiment
in a manner that can be easily located, linked, accessed and reused to repeat an
experiment.
      </p>
      <p>
        L. Kelly [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] proposed using the Living Labs methodology, an emerging
evaluation paradigm in IR and recommender systems that provides a platform for
supporting shared evaluation tasks. In this case, Living Labs will be adapted to
allow for shared evaluation tasks in the UMAP community and, specifically, to
support the individual requirements and differences, privacy concerns and the
interactive nature of the space. In general, Living labs hold great promise for
conducting realistic evaluations, with real users in natural task environments, and
more importantly allowing for cross comparability (e.g. by providing a
benchmarking platform, perform rankings) across research centres.
      </p>
      <p>
        Adaji and Vassileva [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] introduced the Persuasive Systems Design (PSD) as a
framework for developing and evaluating persuasive systems. Despite its
extensive use as a guide for developing persuasive systems, its use as an evaluation tool
for persuasive systems is yet to be exploited. The PSD framework is comprised
of persuasive principles that could be used to develop and implement strategies
that encourage personalisation and adaptation to user preferences. Using
Netflix as a case study, the authors identified the implementation of the persuasive
principles and the design of system features. This study can act as a guide for
the development of evaluation metrics for persuasive related shared tasks.
      </p>
      <p>
        Bogina and Kuflik [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] pointed out that User Adaptive Systems (UAS) do not
intersect with software evaluations as commonly defined in Software
Engineering domain. As a result, the authors suggested adopting the common software
engineering practices and changing the community’s practice and methods by
integrating software testing as an integral part of the shared task evaluation
process. This would result in more easily reproducible/reusable tasks and data
for other members of the community.
      </p>
      <p>
        Next, P.D. Bra [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] indicates that there is a strong focus on comparative
evaluation in the research field of User Modeling, Adaptation and Personalization. In
particular, the author discusses and argues when it is reasonable (makes sense) to
perform such evaluation on adaptive systems and applications. For adaptive
systems, the author argues that these types of comparisons have not gained much
acceptance as being “evaluation” by the UMAP community. For applications,
the author argues that it is difficult to perform a meaningful evaluation because
it is hard to find something to compare the (use of the) application with. In
both cases, having a common reference model and applying layered approaches
among others may help the community get started and allow different systems
and applications to be compared.
      </p>
      <p>
        Staikopoulos and Conlan [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ] focused on evaluating user-adaptive systems
and pointed out that it is still a challenging research issue and a difficult task.
This is because of the lack of widely accepted evaluation methods, data and
the difficulty on generalizing the application areas (e.g. learning, information
systems, business). In order to move towards comparative evaluations of
useradaptive systems and to ensure a scientific process the authors indicated some
vital requirements. More specifically, the authors proposed developing a flexible
common reference model and related metrics upon which user-adaptive systems
and approaches could be evaluated and compared against each other both
comprehensively (as a whole) as well as upon specific adaptation layers and aspects.
The layers should encompass different aspects of a user-adaptive system such as
a) inferring User/Group properties, b) identifying the user environment and
context (e.g. location, affect), c) evaluating personalised content retrieval, d)
evaluating the underlying decision making mechanisms, strategies and algorithms,
e) evaluating the adaptation of content, navigation, presentation or user
feedback and support, f) the user interaction and experience (e.g. evaluate usability,
satisfaction) as well as, g) the system efficiency (e.g. scalability, responsiveness).
      </p>
      <p>In addition, Yousuf and Conlan [22] proposed evaluating the usage of visual
narratives as a way to indicate positive behavioural change in student
engagement levels. In particular, their paper describes the VisEN framework, to provide
visual narratives to students and motivate them with their level of engagement
with their course.</p>
      <p>
        Finally, based on the proven existing relationship between the language
usage and the author’s personality, Chin and Wright [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] proposed using specific
evaluation metrics and statistic tests for inferring (predicting user features) and
benchmarking user’s personality from text. Such guidelines and metrics can then
be used to design and evaluate related evaluation tasks. To do so, the authors
reported the need for having an established corpora with detailed reporting
requirements that will allow researchers to easily compare their algorithms for
inferring personality from text. However, there will always be a need to extend
the corpora to increase the coverage of different types of writing, time periods,
and localities. As a result, the authors recommended having a series of corpora,
with, perhaps, one added every few years to keep providing new data to the
community.
4
      </p>
    </sec>
    <sec id="sec-4">
      <title>Discussion</title>
      <p>Developing means to conduct shared evaluation in the user modelling,
adaptation and personalization (UMAP) space is inherently difficult. Not least because
of privacy concerns, individual differences in behaviours between the users of
systems and challenges associated with working in interactive scenarios. A
further challenge is the underlying value of such datasets. Datasets which detail
the actions or interests of authentic users are viewed as valuable, and in many
cases, proprietry. This compounds the challenge of accessing this data. However,
without this access the only resort that remains for reseachers to obtain
up-todate user data is via systems that were created within a research environment
leading to potentially low numbers and data that is not up-to-date.</p>
      <p>Overcoming these challenges will require greatly advancing our
understanding of, and the methodology associated with UMAP evaluation. Challenges
include:
– Defining tasks and scenarios for evaluation purposes
– Identification of potential corpora for shared tasks
– Interesting target tasks and explanations of their importance
– Combining existing evaluation metrics and methods
– Improving on previously suggested metrics and methods
– Proposing new evaluation metrics and methods
– Investiagting anonymization or decentralisation of user data to ensure
proprietry value is not compromised
– Exploring potential partnerships with companies which hold user data to
discuss how research can be conducted without the risk of privacy or value
loss</p>
      <p>The papers at the 1st EvalUMAP workshop (2016) covered both challenges
and potential solutions associated with this area. It is envisaged that future
workshops will build on these outcomes, creating a forum to present innovative
datasets and shared challenges using these datasets to evaluate systems. The
aim of this year’s EvalUMAP workshop (2017) is to start scoping and designing
a shared task(s). The resulting shared task(s) are to be accompanied by
appropriate models, content, metadata, user behaviours, etc., and can be used to
comprehensively compare how different approaches and systems perform. In
addition, a number of metrics and observations will be outlined, that participants
will be expected to perform to facilitate comparison.</p>
      <p>To create a community around shared tasks for user modelling it is envisaged
that the following aspects have to be addressed: (1) A clear understanding of the
challenges and requirements related to the design of a shared task approach in
the UMAP space; (2) the identification of suitable, publicly accessible datasets;
and (3) an initial description for shared task evaluations using these identified
and suitable datasets.</p>
      <p>Establishing shared tasks that cover the many facets of a full
personalisation system is challenging. With that in mind, the plan is to escalate the tasks
year-on-year, starting with a user modelling challenge, then layering in some
indicative personalisation decision making processes based on changes in user
models, before identifying a mechanism to incorporate real users into the shared
task. This escalation is necessary as the user is a complex element of the
personalisation process; their actions heavily influence changes to the user model
and subsequent personalisation decisions. In order to effectively compare
different systems and approaches it is necessary to incorporate users in a meaningful
and replicable manner. This of course presents an overhead in running shared
tasks. All that being said, our plan is to run the first shared task in 2017-2018
as a static user modelling challenge based on historic social media data from the
users and explicitly captured information about their expertise. The ADAPT
Centre has committed to support these shared tasks.
5</p>
    </sec>
    <sec id="sec-5">
      <title>Conclusions</title>
      <p>The EvalUMAP workshop series considers the strengths and limitations of
existing work in UMAP evaluation, while moving towards its ambitious goal of
designing and establishing a forum for comparative evaluation in the UMAP
space. The long term vision of EvalUMAP is the establishment of an annual
shared challenge series. The workshop this year will focus on identifying
candidate datasets that meet specific requirements (e.g. ownership, accessibility) and
that could form the basis for designing shared task challenges and evaluations
for the academic year 2017-18, which will be presented at an EvalUMAP 2018
forum. It is not intended that this is the only form of scholarly advancement, but
through the shared tasks published and managed by the EvalUMAP Workshop
a common baseline for comparison may be established.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgements</title>
      <p>The ADAPT Centre for Digital Content Technology is funded under the SFI
Research Centres Programme (Grant 13/RC/2106) and is co-funded under the
European Regional Development Fund.
19. Van Velsen, L., van der Geest, T., Klaassen, R.: Identifying usability issues for
personalization during formative evaluations: A comparison of three methods.
International Journal of Human-Computer Interaction 27(7), 670–698 (2011)
20. Van Velsen, L., van der Geest, T., Klaassen, R., Steehouder, M.: User-centered
evaluation of adaptive and adaptable systems: a literature review. The Knowledge
Engineering Review 23(3), 261–281 (2008)
21. Weibelzahl, S., Weber, G.: Advantages, opportunities, and limits of empirical
evaluations: Evaluating adaptive systems. Künstliche Intelligenz 3, 17–20 (2002)
22. Yousuf, B., Conlan, O.: Motivating behavioral change through personalized visual
narratives. EvalUMAP 16, UMAP 2016 Extended Proceedings 1618 (2016)</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Adaji</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vassileva</surname>
          </string-name>
          , J.:
          <article-title>Evaluating persuasive systems using the psd framework</article-title>
          .
          <source>EvalUMAP 16, UMAP 2016 Extended Proceedings</source>
          <volume>1618</volume>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Balog</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kelly</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schuth</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Head first: Living labs for ad-hoc search evaluation (</article-title>
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Bogina</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kuflik</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>Building a bridge between user-adaptive systems evaluation and software testing</article-title>
          .
          <source>EvalUMAP 16, UMAP 2016 Extended Proceedings</source>
          <volume>1618</volume>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Bra</surname>
          </string-name>
          , P.D.:
          <article-title>Evaluating adaptive systems and applications is often nonsense</article-title>
          .
          <source>EvalUMAP 16, UMAP 2016 Extended Proceedings</source>
          <volume>1618</volume>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Chin</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chin</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>Empirical evaluation of user models and user-adapted systems</article-title>
          .
          <source>User Modeling and User-Adapted Interaction</source>
          <volume>11</volume>
          (
          <issue>1</issue>
          ),
          <fpage>327</fpage>
          -
          <lpage>337</lpage>
          (
          <year>2001</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Chin</surname>
            ,
            <given-names>D.N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wright</surname>
            ,
            <given-names>W.R.</given-names>
          </string-name>
          :
          <article-title>Evaluation metrics for inferring personality from text</article-title>
          .
          <source>EvalUMAP 16, UMAP 2016 Extended Proceedings</source>
          <volume>1618</volume>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Conlan</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kelly</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Koidl</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lawless</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Levacher</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Staikopoulos</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Evalumap2016: Towards comparative evaluation in the user modelling, adaptation</article-title>
          and personalization space workshop (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Hopfgartner</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kille</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lommatzsch</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Plumbaum</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Brodt</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Heintz</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>Benchmarking news recommendations in a living lab (</article-title>
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <given-names>Hui</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            ,
            <surname>Soboroff</surname>
          </string-name>
          ,
          <string-name>
            <surname>I.</surname>
          </string-name>
          :
          <article-title>Trec 2016 dynamic domain track overview (</article-title>
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Höök</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          :
          <article-title>Steps to take before intelligent user interfaces become real</article-title>
          .
          <source>Interacting with Computers</source>
          <volume>12</volume>
          (
          <issue>4</issue>
          ),
          <fpage>409</fpage>
          -
          <lpage>426</lpage>
          (
          <year>2000</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Kelly</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          :
          <article-title>Living labs for umap evaluation</article-title>
          .
          <source>EvalUMAP 16, UMAP 2016 Extended Proceedings</source>
          <volume>1618</volume>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Koidl</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Levacher</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Conlan</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Steichen</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>Ecp: Evaluation community portal a portal for evaluation and collaboration in user modelling and personalisation research</article-title>
          . vol.
          <volume>1618</volume>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Hernández del Olmo</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gaudioso</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          :
          <article-title>Evaluation of recommender systems: A new approach</article-title>
          .
          <source>Expert Systems with Applications</source>
          <volume>35</volume>
          (
          <issue>3</issue>
          ),
          <fpage>790</fpage>
          -
          <lpage>804</lpage>
          (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Pandit</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hamed</surname>
            ,
            <given-names>R.G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lawless</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lewis</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>The use of open data to improve the repeatability of adaptivity and personalisation experiment</article-title>
          .
          <source>EvalUMAP 16, UMAP 2016 Extended Proceedings</source>
          <volume>1618</volume>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Paramythis</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Weibelzahl</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Masthoff</surname>
          </string-name>
          , J.:
          <article-title>Layered evaluation of interactive adaptive systems: framework and formative methods</article-title>
          .
          <source>User Modeling and UserAdapted Interaction</source>
          <volume>20</volume>
          (
          <issue>5</issue>
          ),
          <fpage>383</fpage>
          -
          <lpage>453</lpage>
          (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Park</surname>
            ,
            <given-names>K.S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hwan Lim</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>A structured methodology for comparative evaluation of user interface designs using usability criteria and measures</article-title>
          .
          <source>International Journal of Industrial Ergonomics</source>
          <volume>23</volume>
          (
          <issue>5-6</issue>
          ),
          <fpage>379</fpage>
          -
          <lpage>389</lpage>
          (
          <year>1999</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>Staikopoulos</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Conlan</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          :
          <article-title>Towards comparative evaluations of user-adaptive software systems</article-title>
          .
          <source>EvalUMAP 16, UMAP 2016 Extended Proceedings</source>
          <volume>1618</volume>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18.
          <string-name>
            <surname>Vahid</surname>
            ,
            <given-names>A.H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hamed</surname>
            ,
            <given-names>R.G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Koidl</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          :
          <article-title>A review of user-centred information retrieval tasks</article-title>
          .
          <source>EvalUMAP 16, UMAP 2016 Extended Proceedings</source>
          <volume>1618</volume>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>