<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Evaluating Adaptive Systems and Applications is often Nonsense</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>CCS Concepts</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Human-centered computing</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Paul De Bra Dept. of Math. and Computer Sciece Eindhoven University of Technology (TU/e) Eindhoven</institution>
          ,
          <country country="NL">the Netherlands</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>In the research field of User Modeling, Adaptation and Personalization there is a strong focus on comparative evaluation. In this discussion paper we as ourselves when it makes sense to perform such evaluation and also when it is complete nonsense. We argue that especially for adaptive systems (and not applications) the typical comparitive evaluations with groups of end-users make no sense. We also argue that for applications it is difficult to perform a meaningful evaluation because it is hard to find something to compare the (use of the) application with.</p>
      </abstract>
      <kwd-group>
        <kwd>usability</kwd>
        <kwd>comparative evaluation</kwd>
        <kwd>systems versus applications</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>INTRODUCTION</title>
      <p>
        Adaptive (web-based) hypermedia [
        <xref ref-type="bibr" rid="ref2 ref8">2, 8</xref>
        ] is being used for many
data-driven web-based services (like YouTube, Facebook,
Amazon, etc.) and for specific expert-driven applications like museum
guides (e. g. the Rijksmuseum CHIP demonstrator [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]) and on-line
course texts created using e. g. Interbook [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], AHA! [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] or GALE [
        <xref ref-type="bibr" rid="ref10 ref11">11,
10</xref>
        ]. As the references show several papers describing adaptive
systems have been published at the ACM Hypertext conference instead
of at UMAP. So called “systems papers” have always been
somewhat problematic to publish because the typical empirical
evaluation with groups of end-users makes no sense. Throughout the
years we can observe that the UMAP research community seems
to have a strong preference for research on methods and
applications and has difficulty in handling papers that merely present a
new “platform”.
      </p>
      <p>A second issue we address in this paper is that when considering
a specific application, for instance an on-line course text, it is
unclear how the benefit of adaptation can be evaluated, as it is hard to
do a fair comparison between applications.
2.</p>
    </sec>
    <sec id="sec-2">
      <title>EVALUATING ADAPTIVE SYSTEMS</title>
      <p>
        When adaptive applications were introduced, mainly in the early
1990s, they were closely integrated with the technology used to
run the application. A good example is ELM-ART [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], an on-line
Lisp tutor which was called “an Intelligent Tutoring System” even
though it was really an Intelligent Tutoring Application. ELM-ART
inspired new developments, but later systems like Interbook, AHA!
and its successor GALE were all realized as “platforms” in which
an author could/can create adaptive on-line courses. The distinction
between the application, e. g. an on-line course, and the system that
makes it possible to deliver the application, is essential: end-user
evaluation of the system makes no sense whereas end-user
evaluation of the application may make sense but is very different.
      </p>
      <p>When we first presented AHA! we were often asked “How good
is the adaptation provided by AHA!?” and our standard answer:
“The quality of the adaptation depends entirely on what the author
of an application supported by AHA! creates.” was never
considered a satisfactory answer. Yet, it was and still is the only possible
answer.</p>
      <p>For the UMAP community it is of vital importance that generic
or general purpose adaptive systems are developed so that researchers
who wish to experiment with adaptation and with applications such
as on-line courses can concentrate on the core of their research
without the need to also develop underlying technology that makes
the adaptation they need possible. At the same time, the community
is not reluctant to accept papers describing new platform
developments because such papers cannot contain an end-user comparative
evaluation without considering also an application running on the
platform, and actually evaluating the application, not the system.</p>
      <p>Interbook could be used to develop and deliver on-line courses
about very different topics, but all being presented in the same
presentation style, using the same adaptation strategies. AHA! and
later GALE went further: they allow the definition of arbitrary
rules for user modeling and adaptation, allow for the conditional
inclusion of fragments and objects in the presentation, and allow
for the use of arbitrary presentation styles, arbitrary layout,
arbitrary choice of link annotation, etc. As a result, the question “How
good is the adaptation provided by AHA! or GALE?” is nonsense.
The systems provide the adaptation an author defines. AHA! and
GALE applications can provide excellent adaptation that greatly
helps learners find their way through an on-line course and AHA!
and GALE applications can also completely mislead learners and
make learning much harder than if there were no adaptation at all.</p>
      <p>
        There is definitely a possibility to evaluate adaptive systems: by
mapping the functionality of the system to existing reference
models like AHAM [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] and GAF [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] the user modeling and adaptation
power of different systems could be compared. And by taking
performance measurements under (synthetic or real-world) load the
ability of systems to handle large numbers of users can be
compared. None of these types of comparisons have gained acceptance
as being “evaluation” by the UMAP community.
3.
      </p>
    </sec>
    <sec id="sec-3">
      <title>EVALUATING ADAPTIVE APPLICATIONS</title>
      <p>
        The core question about adaptive applications is “Does
adaptation help?”. This question was for instance addressed in [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ],
considering adaptive link annotation in an Interbook application. What
typically happens in such an evaluation is that a user group is
divided in two subgroups; one subgroup gets to work with the
adaptive application and the other subgroup gets to work with that
application with the adaptive functionality turned off. A number of
performance indicators are then measured, like how many
navigation steps users make, how well they perform on a test, etc. The
results are somewhat predictable: the users of the adaptive
application perform better and are more satisfied than the users of the
“crippled” adaptive application that had no adaptation. If an
author creates an on-line (hypermedia) course text and cannot use
any adaptation a lot of care will go into deciding where to place
which links. Users who study a course page may have reached that
page through many different paths. These users will have different
knowledge and knowledge gaps. When the page can link to
another (related) topic, the author needs to decide carefully whether
to make that link available or not. In an adaptive application the
author can place the link and the system will decide, based on the
user’s knowledge and on prerequisite relationships, whether that
user will be recommended to follow that link at that moment in
time. So it is likely that the course text will contain many links that
will sometimes be recommended and sometimes not. Simply
making these link recommended all the time does not give the
application a fair chance in any comparison with the adaptive application.
      </p>
      <p>
        Problems and pitfalls in the evaluation of adaptive applications
have already been identified, for instance by Weibelzahl [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. Even
though the title of that publication mentions “adaptive systems” it is
really more about applications. In [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] it is argued that separating the
evaluation of different aspects (rather than brute force enabling or
disabling all adaptation) can help to pinpoint where the adaptation
helps or fails. The paper [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] provides a detailed description of a
layered approach to the evaluation that makes it clear that proper
evaluation of an adaptive application is a huge task, not something
to describe in a short section at the end of a research paper. It
should not come as a surprise that many evaluations that have been
made of adaptive applications are have not been performed to such
a rigorous standard and are actually closer to nonsense.
      </p>
    </sec>
    <sec id="sec-4">
      <title>CONCLUSIONS AND DISCUSSION</title>
      <p>
        In this paper we have shown that the typical end-user evaluation
with groups of users using different versions of applications 1)
cannot be used at all to evaluate adaptive systems or platforms and 2)
that performing a proper evaluation is a major undertaking (see [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ])
when one wants to avoid comparisons that make no sense.
      </p>
      <p>Two interesting discussion topics for the workshop are:
Since we really need to have generic platforms that can be
used to perform UMAP research without the need for every
researcher to create their own special-purpose platform we
need to discuss criteria for assessing whether a paper
describing the development of a generic system is acceptable. The
current practice is that UMAP researchers publish system
descriptions in other venues. An incentive should be created at
UMAP to embrace “systems papers”.</p>
      <p>
        In answering the “Does adaptation help?” question we should
have clearer criteria for the comparative evaluation to avoid
the pitfall of simply comparing an adaptive with a non-adaptive
version and hoping the results are not nonsense. We should
define some quality standard for the applications with which
we compare. The layered approach published in [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], among
others, may help us get started with setting that standard.
5.
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>L.</given-names>
            <surname>Aroyo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Stash</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Gorgels</surname>
          </string-name>
          , and
          <string-name>
            <given-names>L.</given-names>
            <surname>Rutledge</surname>
          </string-name>
          . CHIP Demonstrator:
          <article-title>Semantics-Driven Recommendations And Museum Tour Generation</article-title>
          . In G.
          <article-title>Schreiber and</article-title>
          K. Aberer, editors,
          <source>The Semantic Web - ISWC/ASWC</source>
          <year>2007</year>
          , volume
          <volume>4825</volume>
          of Lecture Notes in Computer Science, pages
          <fpage>879</fpage>
          -
          <lpage>886</lpage>
          . Springer,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>P.</given-names>
            <surname>Brusilovsky</surname>
          </string-name>
          .
          <article-title>Adaptive hypermedia. User Modeling</article-title>
          and
          <string-name>
            <surname>User-Adapted</surname>
            <given-names>Interaction</given-names>
          </string-name>
          ,
          <volume>11</volume>
          (
          <issue>1-2</issue>
          ):
          <fpage>87</fpage>
          -
          <lpage>110</lpage>
          ,
          <year>2001</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>P.</given-names>
            <surname>Brusilovsky</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Eklund</surname>
          </string-name>
          , and
          <string-name>
            <given-names>E.</given-names>
            <surname>Schwarz</surname>
          </string-name>
          .
          <article-title>Web-based education for all: a tool for development adaptive courseware</article-title>
          .
          <source>Computer Networks and ISDN Systems</source>
          ,
          <volume>30</volume>
          (
          <issue>1-7</issue>
          ):
          <fpage>291</fpage>
          -
          <lpage>300</lpage>
          ,
          <year>1998</year>
          .
          <source>Proceedings of the Seventh International World Wide Web Conference.</source>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>P.</given-names>
            <surname>Brusilovsky</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Karagiannidis</surname>
          </string-name>
          , and
          <string-name>
            <given-names>D.</given-names>
            <surname>Sampson</surname>
          </string-name>
          .
          <article-title>The benefits of layered evaluation of adaptive applications and services</article-title>
          . In In, pages
          <fpage>1</fpage>
          -
          <lpage>8</lpage>
          ,
          <year>2001</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>P.</given-names>
            <surname>Brusilovsky</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. W.</given-names>
            <surname>Schwarz</surname>
          </string-name>
          , and
          <string-name>
            <given-names>G.</given-names>
            <surname>Weber.</surname>
          </string-name>
          Elm-art:
          <article-title>An intelligent tutoring system on world wide web</article-title>
          .
          <source>In Proceedings of the Third International Conference on Intelligent Tutoring Systems, ITS '96</source>
          , pages
          <fpage>261</fpage>
          -
          <lpage>269</lpage>
          , London, UK, UK,
          <year>1996</year>
          . Springer-Verlag.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>P.</given-names>
            <surname>De Bra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.-J.</given-names>
            <surname>Houben</surname>
          </string-name>
          , and
          <string-name>
            <given-names>H.</given-names>
            <surname>Wu</surname>
          </string-name>
          .
          <article-title>Aham: A dexter-based reference model for adaptive hypermedia</article-title>
          .
          <source>In Proceedings of the Tenth ACM Conference on Hypertext and Hypermedia :</source>
          Returning to Our Diverse Roots:
          <article-title>Returning to Our Diverse Roots</article-title>
          ,
          <source>HYPERTEXT '99</source>
          , pages
          <fpage>147</fpage>
          -
          <lpage>156</lpage>
          , New York, NY, USA,
          <year>1999</year>
          . ACM.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>P.</given-names>
            <surname>De Bra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Smits</surname>
          </string-name>
          , and
          <string-name>
            <given-names>N.</given-names>
            <surname>Stash</surname>
          </string-name>
          . The design of aha!
          <source>In Proceedings of the seventeenth ACM conference on Hypertext, page 133. ACM</source>
          ,
          <year>2006</year>
          , adaptive version at http://aha.win.tue.nl/ahadesign/.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>E.</given-names>
            <surname>Knutov</surname>
          </string-name>
          , P. De Bra, and
          <string-name>
            <given-names>M.</given-names>
            <surname>Pechenizkiy</surname>
          </string-name>
          .
          <article-title>Ah 12 years later: a comprehensive survey of adaptive hypermedia methods and techniques</article-title>
          .
          <source>New Review of Hypermedia and Multimedia</source>
          ,
          <volume>15</volume>
          (
          <issue>1</issue>
          ):
          <fpage>5</fpage>
          -
          <lpage>38</lpage>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>A.</given-names>
            <surname>Paramythis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Weibelzahl</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J.</given-names>
            <surname>Masthoff</surname>
          </string-name>
          .
          <article-title>Layered evaluation of interactive adaptive systems: framework and formative methods. User Modeling</article-title>
          and
          <string-name>
            <surname>User-Adapted</surname>
            <given-names>Interaction</given-names>
          </string-name>
          ,
          <volume>20</volume>
          (
          <issue>5</issue>
          ):
          <fpage>383</fpage>
          -
          <lpage>453</lpage>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>D.</given-names>
            <surname>Smits</surname>
          </string-name>
          .
          <article-title>Towards a Generic Distributed Adaptive Hypermedia Environment</article-title>
          .
          <source>PhD thesis</source>
          , Eindhoven University of Technology, adaptive version on http://gale.win.tue.nl/thesis/,
          <source>ISBN 978-90-386-3115-8</source>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>D.</given-names>
            <surname>Smits and P. De Bra</surname>
          </string-name>
          .
          <article-title>Gale: a highly extensible adaptive hypermedia engine</article-title>
          .
          <source>In Proceedings of the twentysecond ACM conference on Hypertext</source>
          , pages
          <fpage>63</fpage>
          -
          <lpage>72</lpage>
          . ACM,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>S.</given-names>
            <surname>Weibelzahl</surname>
          </string-name>
          .
          <article-title>Problems and pitfalls in the evaluation of adaptive systems</article-title>
          .
          <source>Adaptable and adaptive hypermedia systems</source>
          ,
          <volume>11</volume>
          :
          <fpage>285</fpage>
          -
          <lpage>299</lpage>
          ,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>