<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Towards Industrial-Strength Usability Evaluation</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Martin Schmettow</string-name>
          <email>schmettow@web.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Passau University Informations Systems II 94032 Passau</institution>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Usability professionals may face strict economic demands on the usability process in near future. This position paper outlines a research agenda to make usability evaluation a predictable and highly efficient engineering process. 1.MOTIVATION Usability professionals are never tired to stress the economic impact of good usability. And indeed, there are several compelling arguments: The first may be derived from the ISO norm 9241-11: Efficiency is regarded as one of the three main criteria of usability and can directly be converted into a bargain. For example, a very efficient interface to an enterprise information system makes users do their tasks more quickly which increases overall throughput. The second argument is specific to web usability. Web users are known to be very impatient with web sites having poor usability, especially with online purchasing; consequently usability directly affects the conversion rate of e-commerce companies. The third argument is from the perspective of software development. It is a widely accepted law, that defect fixing costs overlinearly depend on how early a defect was introduced and how late it was found. This is a justification for doing intensive usability evaluation early in system development.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Usability Evaluation</kwd>
        <kwd>Measurement</kwd>
        <kwd>Process Quality</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>But, many usability professionals still act under the paradigm of
discount usability. In a broad sense this denotes: usability
evaluation as a best effort strategy and conducted iteratively by
experts who just know what they are doing. What, if clients or
employers of usability professionals start taking the above
economic arguments seriously? For example: What, if a start-up
company has an innovative product idea and plenty of venture
capital, but usability is mission-critical and they have only one
shot? Will they rely on discount usability? Will they accept the
good reputation of a usability company as the only guarantee? It
is more likely, that they want objective preconditions, like a
proven and certified evaluation plan. And maybe they even want
quantitative guarantees and proven contract fulfillment, like:
There is no show stopper left in the system and at least 90% of
serious problems are identified. The paradigm of discount
usability is inappropriate in such cases.</p>
      <p>
        Research on the usability evaluation process has seen two major
debates (research agendas, respectively): The
Five-Users-Is-NotEnough debate and the Damaged Merchandise debate. The Five
Users debate is about how to reliably plan and control usability
evaluation studies, whereas the Damaged Merchandise debate
treats the topic of how to compare evaluation methods in fair and
valid way. In the following, I will argue why we must continue
these research agendas, in order to make usability evaluation a
well understood and highly optimized engineering activity. But, I
will also claim that we have to put off some blinders.
2.WHY TO CONTINUE THE “FIVE USERS”
DEBATE
The five users debate goes back to Nielsen and Landauers
suggestion to model the progress of evaluation studies as a
geometric series [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. Unfortunately, the debate was primarily
carried by an oversimplification of Nielsen, who trivialized his
own findings in stating that testing five users is enough in
industrial practice [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. This is, by the way, an excellent example
of the discount usability paradigm, which may turn out obsolete.
In contrast, several researchers went deeper into the theoretical
impact of this model: The phenomenon of variance in the process
was discovered [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], good task design was found to be a major
impact factor [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] and basic stochastic assumptions of the model
were questioned [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. A recent contribution was the proof that the
geometric model is inherently flawed by falsely assuming that
usability defects are equally visible and sessions equally effective
[
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. Instead, the beta-geometric model, accounting for
heterogeneity, was shown to better predict the process.
But, this is still an oversimplification that does not comprise all
impact factors found in industrial studies. For example, recently I
tried to fit the data reported from the CUE-4 study with the
betageometric model – with disappointing results: The model could
not sufficiently explain the overwhelming number of defects that
were detected only once [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. In consequence, there is still no
reliable estimation of how many defects were left undetected. For
the first, there are two options for enhancing the model in order to
better fit the data and reliably plan and control usability studies:
First, the study progress has to be tracked on the finer grained
level of single tasks presented in a usability test (or imagined by
usability inspectors). Specifically, this may help identify when a
certain set of tasks is “exhausted” and replace it by new tasks that
make further defects observable. Second, the current models do
not handle the problem of false alarms in evaluation studies.
These may well be liable for the misfit reported above. Currently,
we are working on an enhanced model to incorporate the
occurrence of false alarms and varying task sets. This hopefully
enables us to better estimate the number of remaining defects
(misses) and to give a probability for a reported defect being a
false alarm. The latter may prevent wasting development
resources on would-be defects and thus has direct economic
impact.
      </p>
    </sec>
    <sec id="sec-2">
      <title>3.BEYOND “CHASING THE HE”</title>
      <p>
        The Damaged Merchandise debate arouse by the harsh critique of
Gray and Salzmann on the poor validity of experiments on UEMs
[
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. However, my main point here is not validity, but the
observation that research on designing UEMs has not made much
substantial progress. Even recent well designed studies are still
very restricted in their contribution to understanding the cognitive
or contextual factors of finding usability defects. Instead, they
make more or less marginal adaptions to common inspection
methods and compare this in a two conditional experimental
design to the Heuristic Evaluation (HE). The observed
effectiveness gains are in many cases marginal (e.g. [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]) or
nonexistent [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]). This “Chasing the HE” approach has the severe
drawback of restricted insight. It lets us only know which of two
procedures is (slightly) better. It does not inform about the
specific interplay of impact factors granting effective defect
identification. But, this is a precondition to design (much) better
procedures, provide adequate training and adjust the evaluation
process to business goals.
      </p>
      <p>
        Only few studies have paid attention to successful versus
unsuccessful cognitive-behavioral strategies of usability experts.
To give an example for a rarely recognized work that has done
better: Perspective based reading is a well known technique in
software inspections and raises effectiveness by reducing
cognitive load. Zhang et. al. have transferred this technique to
usability inspection and have found likewise improvements [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ].
Another positive example is how Woolrych et. al. analyzed the
knowledge resources involved in usability inspections [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. (They
also made some points on how false alarms arise.)
These are interesting and relevant results, as they may lead to
methods and training concepts for increased effectiveness of
usability experts. But, there still is a lack of quantitative research
on such topics. Especially, defects are likely having qualitative
properties that make a difference with respect to behavioral
strategies and knowledge resources. Frøkjaer and Hornbaek have
found differing detection profiles for two inspection methods after
classifying defects with the User Action Framework [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Another
promising way to go is to search for defect classes in the raw data
from evaluation processes and derive an empirically valid
classification Advanced statistical exploration techniques, like
differential item functioning from item response theory [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] or
binary cluster analysis probably apply well to this problem, in
contrast to ordinary variance analysis. The strength of these
techniques is that they to not require manipulating independent
variables. Instead, they can reveal latent variables in existing data
sets, including results from industrial studies.
      </p>
      <p>These approaches may be used to profile methods according to
their effectiveness regarding certain types of defects. In industrial
settings this is useful for selecting a method appropriate to the
development context. For example, we may purposefully choose a
method for identification of task related defects early in
development. Later in the development process another method
may serve identification of superficial design issues. Another
possibility is aligning the evaluation focus to business goals, e.g.
evaluating for efficiency in case a system is primarily aimed at
experts.</p>
    </sec>
    <sec id="sec-3">
      <title>4.CONCLUSION</title>
      <p>Modern software engineering is well regarding economic
demands: efficiency of development processes, early defect
discovery and aligning software qualities to business goals. The
usability profession is still dragging a little behind, but may
sometimes face their customers’ claims for process approval,
efficiency and guarantees. The aim of this paper was to point out
valuable research agendas in the past, but to also identify future
directions of research: Quantitative research with refined
experimental designs and advanced statistical techniques may
reveal relevant properties on several levels of the usability
evaluation process. Knowing the properties on process level
results in better approaches to plan and control studies towards
given business goals. Knowing the properties on the
cognitivebehavioral level are a precondition to significantly raise
effectiveness and appropriateness of evaluation processes. Much
can be achieved with advanced statistical techniques on existing
data sets. The minimum to get is specific and well grounded
hypotheses that will inspire for well designed and elaborate
experimental studies to deeply understand the anatomy of
usability evaluation.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Alan</given-names>
            <surname>Woolrych</surname>
          </string-name>
          , Gilbert Cockton, and
          <string-name>
            <given-names>Mark</given-names>
            <surname>Hindmarch</surname>
          </string-name>
          .
          <article-title>Knowledge Resources in Usability Inspection</article-title>
          .
          <source>In Proceedings of the HCI</source>
          <year>2005</year>
          ,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>David</surname>
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Caulton</surname>
          </string-name>
          .
          <article-title>Relaxing the homogeneity assumption in usability testing</article-title>
          .
          <source>Behaviour &amp; Information Technology</source>
          ,
          <volume>20</volume>
          (
          <issue>1</issue>
          ):
          <fpage>1</fpage>
          -
          <lpage>7</lpage>
          ,
          <year>2001</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Laura</given-names>
            <surname>Faulkner</surname>
          </string-name>
          .
          <article-title>Beyond the five-user assumption: Benefits of increased sample sizes in usability testing</article-title>
          .
          <source>Behavior Research Methods, Instruments &amp; Computers</source>
          ,
          <volume>35</volume>
          (
          <issue>3</issue>
          ):
          <fpage>379</fpage>
          -
          <lpage>383</lpage>
          ,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Erik</given-names>
            <surname>Frøkjaer</surname>
          </string-name>
          and
          <string-name>
            <given-names>Kasper</given-names>
            <surname>Hornbaek</surname>
          </string-name>
          .
          <article-title>Metaphors of human thinking for usability inspection and design</article-title>
          .
          <source>ACM Trans. Comput</source>
          .-Hum. Interact.,
          <volume>14</volume>
          (
          <issue>4</issue>
          ):
          <fpage>1</fpage>
          -
          <lpage>33</lpage>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Wayne</surname>
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Gray</surname>
            and
            <given-names>Marilyn C.</given-names>
          </string-name>
          <string-name>
            <surname>Salzman</surname>
          </string-name>
          .
          <article-title>Damaged merchandise? A review of experiments that compare usability evaluation methods</article-title>
          .
          <source>Human-Computer Interaction</source>
          ,
          <volume>13</volume>
          (
          <issue>3</issue>
          ):
          <fpage>203</fpage>
          -
          <lpage>261</lpage>
          ,
          <year>1998</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Gitte</given-names>
            <surname>Lindgaard</surname>
          </string-name>
          and
          <string-name>
            <given-names>Jarinee</given-names>
            <surname>Chattratichart</surname>
          </string-name>
          .
          <article-title>Usability testing: What have we overlooked?</article-title>
          <source>In CHI '07: Proceedings of the SIGCHI conference on Human factors in computing systems</source>
          , pages
          <fpage>1415</fpage>
          -
          <lpage>1424</lpage>
          , New York, NY, USA,
          <year>2007</year>
          . ACM Press.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Rolf</given-names>
            <surname>Molich</surname>
          </string-name>
          and
          <string-name>
            <given-names>Joseph S.</given-names>
            <surname>Dumas</surname>
          </string-name>
          .
          <article-title>Comparative usability evaluation (CUE-4)</article-title>
          .
          <source>Behaviour &amp; Information Technology</source>
          ,
          <volume>27</volume>
          (
          <issue>3</issue>
          ),
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Jakob</given-names>
            <surname>Nielsen</surname>
          </string-name>
          .
          <article-title>Why you only need to test with 5 users</article-title>
          .
          <source>Jakob Nielsens Alertbox, March</source>
          <volume>19</volume>
          2000. http://www.useit.com/alertbox/20000319.html.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Jakob</given-names>
            <surname>Nielsen</surname>
          </string-name>
          and
          <string-name>
            <given-names>Thomas K.</given-names>
            <surname>Landauer</surname>
          </string-name>
          .
          <article-title>A mathematical model of the finding of usability problems</article-title>
          .
          <source>In CHI '93: Proceedings of the SIGCHI conference on Human factors in computing systems</source>
          , pages
          <fpage>206</fpage>
          -
          <lpage>213</lpage>
          , New York, NY, USA,
          <year>1993</year>
          . ACM Press.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>Martin</given-names>
            <surname>Schmettow</surname>
          </string-name>
          .
          <article-title>Heterogeneity in the usability evaluation process</article-title>
          . In David England and Russell Beale, editors,
          <source>Proceedings of the HCI</source>
          <year>2008</year>
          , volume
          <volume>1</volume>
          of People and Computers, pages
          <fpage>89</fpage>
          -
          <lpage>98</lpage>
          . British Computing Society,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>Martin</given-names>
            <surname>Schmettow</surname>
          </string-name>
          and
          <string-name>
            <given-names>Sabine</given-names>
            <surname>Niebuhr</surname>
          </string-name>
          .
          <article-title>A pattern-based usability inspection method: First empirical performance measures and future issues</article-title>
          .
          <source>In Devina Ramduny-Ellis and Dorothy Rachovides</source>
          , editors,
          <source>Proceedings of the HCI</source>
          <year>2007</year>
          , volume
          <volume>2</volume>
          of People and Computers, pages
          <fpage>99</fpage>
          -
          <lpage>102</lpage>
          . BCS,
          <year>September 2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>Martin</given-names>
            <surname>Schmettow</surname>
          </string-name>
          and
          <string-name>
            <given-names>Wolfgang</given-names>
            <surname>Vietze</surname>
          </string-name>
          .
          <article-title>Introducing item response theory for measuring usability inspection processes</article-title>
          .
          <source>In CHI 2008 Proceedings</source>
          , pages
          <fpage>893</fpage>
          -
          <lpage>902</lpage>
          . ACM SIGCHI,
          <year>April 2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>Zhang</surname>
            <given-names>Zhijun</given-names>
          </string-name>
          , Victor Basili, and
          <string-name>
            <given-names>Ben</given-names>
            <surname>Shneiderman</surname>
          </string-name>
          .
          <article-title>An empirical study of perspective based usability inspection</article-title>
          .
          <source>Technical report</source>
          , University of Maryland, Human-Computer Interaction Lab,
          <year>1998</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>