Towards Industrial-Strength Usability Evaluation
                                                         Martin Schmettow
                                                           Passau University
                                                        Informations Systems II
                                                       94032 Passau, Germany
                                                       schmettow@web.de
ABSTRACT                                                              Users debate is about how to reliably plan and control usability
Usability professionals may face strict economic demands on the       evaluation studies, whereas the Damaged Merchandise debate
usability process in near future. This position paper outlines a      treats the topic of how to compare evaluation methods in fair and
research agenda to make usability evaluation a predictable and        valid way. In the following, I will argue why we must continue
highly efficient engineering process.                                 these research agendas, in order to make usability evaluation a
                                                                      well understood and highly optimized engineering activity. But, I
                                                                      will also claim that we have to put off some blinders.
Categories and Subject Descriptors
H.5.2 User Interfaces (e.g. HCI) Evaluation/methodology
                                                                      2.WHY TO CONTINUE THE “FIVE USERS”
Keywords                                                              DEBATE
                                                                      The five users debate goes back to Nielsen and Landauers
Usability Evaluation, Measurement, Process Quality
                                                                      suggestion to model the progress of evaluation studies as a
                                                                      geometric series [9]. Unfortunately, the debate was primarily
1.MOTIVATION                                                          carried by an oversimplification of Nielsen, who trivialized his
Usability professionals are never tired to stress the economic        own findings in stating that testing five users is enough in
impact of good usability. And indeed, there are several               industrial practice [8]. This is, by the way, an excellent example
compelling arguments: The first may be derived from the ISO           of the discount usability paradigm, which may turn out obsolete.
norm 9241-11: Efficiency is regarded as one of the three main
                                                                      In contrast, several researchers went deeper into the theoretical
criteria of usability and can directly be converted into a bargain.
                                                                      impact of this model: The phenomenon of variance in the process
For example, a very efficient interface to an enterprise
                                                                      was discovered [3], good task design was found to be a major
information system makes users do their tasks more quickly
                                                                      impact factor [6] and basic stochastic assumptions of the model
which increases overall throughput. The second argument is
                                                                      were questioned [2]. A recent contribution was the proof that the
specific to web usability. Web users are known to be very
                                                                      geometric model is inherently flawed by falsely assuming that
impatient with web sites having poor usability, especially with
                                                                      usability defects are equally visible and sessions equally effective
online purchasing; consequently usability directly affects the
                                                                      [10]. Instead, the beta-geometric model, accounting for
conversion rate of e-commerce companies. The third argument is
                                                                      heterogeneity, was shown to better predict the process.
from the perspective of software development. It is a widely
accepted law, that defect fixing costs overlinearly depend on how     But, this is still an oversimplification that does not comprise all
early a defect was introduced and how late it was found. This is a    impact factors found in industrial studies. For example, recently I
justification for doing intensive usability evaluation early in       tried to fit the data reported from the CUE-4 study with the beta-
system development.                                                   geometric model – with disappointing results: The model could
                                                                      not sufficiently explain the overwhelming number of defects that
But, many usability professionals still act under the paradigm of
                                                                      were detected only once [7]. In consequence, there is still no
discount usability. In a broad sense this denotes: usability
                                                                      reliable estimation of how many defects were left undetected. For
evaluation as a best effort strategy and conducted iteratively by
                                                                      the first, there are two options for enhancing the model in order to
experts who just know what they are doing. What, if clients or
                                                                      better fit the data and reliably plan and control usability studies:
employers of usability professionals start taking the above
                                                                      First, the study progress has to be tracked on the finer grained
economic arguments seriously? For example: What, if a start-up
                                                                      level of single tasks presented in a usability test (or imagined by
company has an innovative product idea and plenty of venture
                                                                      usability inspectors). Specifically, this may help identify when a
capital, but usability is mission-critical and they have only one
                                                                      certain set of tasks is “exhausted” and replace it by new tasks that
shot? Will they rely on discount usability? Will they accept the
                                                                      make further defects observable. Second, the current models do
good reputation of a usability company as the only guarantee? It
                                                                      not handle the problem of false alarms in evaluation studies.
is more likely, that they want objective preconditions, like a
                                                                      These may well be liable for the misfit reported above. Currently,
proven and certified evaluation plan. And maybe they even want
                                                                      we are working on an enhanced model to incorporate the
quantitative guarantees and proven contract fulfillment, like:
                                                                      occurrence of false alarms and varying task sets. This hopefully
There is no show stopper left in the system and at least 90% of
                                                                      enables us to better estimate the number of remaining defects
serious problems are identified. The paradigm of discount
                                                                      (misses) and to give a probability for a reported defect being a
usability is inappropriate in such cases.
                                                                      false alarm. The latter may prevent wasting development
Research on the usability evaluation process has seen two major       resources on would-be defects and thus has direct economic
debates (research agendas, respectively): The Five-Users-Is-Not-      impact.
Enough debate and the Damaged Merchandise debate. The Five
3.BEYOND “CHASING THE HE”                                                4.CONCLUSION
The Damaged Merchandise debate arouse by the harsh critique of           Modern software engineering is well regarding economic
Gray and Salzmann on the poor validity of experiments on UEMs            demands: efficiency of development processes, early defect
[5]. However, my main point here is not validity, but the                discovery and aligning software qualities to business goals. The
observation that research on designing UEMs has not made much            usability profession is still dragging a little behind, but may
substantial progress. Even recent well designed studies are still        sometimes face their customers’ claims for process approval,
very restricted in their contribution to understanding the cognitive     efficiency and guarantees. The aim of this paper was to point out
or contextual factors of finding usability defects. Instead, they        valuable research agendas in the past, but to also identify future
make more or less marginal adaptions to common inspection                directions of research: Quantitative research with refined
methods and compare this in a two conditional experimental               experimental designs and advanced statistical techniques may
design to the Heuristic Evaluation (HE). The observed                    reveal relevant properties on several levels of the usability
effectiveness gains are in many cases marginal (e.g. [4]) or non-        evaluation process. Knowing the properties on process level
existent [11]). This “Chasing the HE” approach has the severe            results in better approaches to plan and control studies towards
drawback of restricted insight. It lets us only know which of two        given business goals. Knowing the properties on the cognitive-
procedures is (slightly) better. It does not inform about the            behavioral level are a precondition to significantly raise
specific interplay of impact factors granting effective defect           effectiveness and appropriateness of evaluation processes. Much
identification. But, this is a precondition to design (much) better      can be achieved with advanced statistical techniques on existing
procedures, provide adequate training and adjust the evaluation          data sets. The minimum to get is specific and well grounded
process to business goals.                                               hypotheses that will inspire for well designed and elaborate
Only few studies have paid attention to successful versus                experimental studies to deeply understand the anatomy of
unsuccessful cognitive-behavioral strategies of usability experts.       usability evaluation.
To give an example for a rarely recognized work that has done
better: Perspective based reading is a well known technique in           5.REFERENCES
software inspections and raises effectiveness by reducing                [1] Alan Woolrych, Gilbert Cockton, and Mark Hindmarch.
cognitive load. Zhang et. al. have transferred this technique to             Knowledge Resources in Usability Inspection. In
usability inspection and have found likewise improvements [13].              Proceedings of the HCI 2005, 2005.
Another positive example is how Woolrych et. al. analyzed the            [2] David A. Caulton. Relaxing the homogeneity assumption in
knowledge resources involved in usability inspections [1]. (They             usability testing. Behaviour & Information Technology,
also made some points on how false alarms arise.)                            20(1):1–7, 2001.
These are interesting and relevant results, as they may lead to          [3] Laura Faulkner. Beyond the five-user assumption: Benefits
methods and training concepts for increased effectiveness of                 of increased sample sizes in usability testing. Behavior
usability experts. But, there still is a lack of quantitative research       Research Methods, Instruments & Computers, 35(3):379–
on such topics. Especially, defects are likely having qualitative            383, 2003.
properties that make a difference with respect to behavioral
strategies and knowledge resources. Frøkjær and Hornbæk have             [4] Erik Frøkjær and Kasper Hornbæk. Metaphors of human
found differing detection profiles for two inspection methods after          thinking for usability inspection and design. ACM Trans.
classifying defects with the User Action Framework [4]. Another              Comput.-Hum. Interact., 14(4):1–33, 2008.
promising way to go is to search for defect classes in the raw data      [5] Wayne D. Gray and Marilyn C. Salzman. Damaged
from evaluation processes and derive an empirically valid                    merchandise? A review of experiments that compare
classification Advanced statistical exploration techniques, like             usability evaluation methods. Human-Computer Interaction,
differential item functioning from item response theory [12] or              13(3):203–261, 1998.
binary cluster analysis probably apply well to this problem, in          [6] Gitte Lindgaard and Jarinee Chattratichart. Usability testing:
contrast to ordinary variance analysis. The strength of these                What have we overlooked? In CHI ’07: Proceedings of the
techniques is that they to not require manipulating independent              SIGCHI conference on Human factors in computing systems,
variables. Instead, they can reveal latent variables in existing data        pages 1415–1424, New York, NY, USA, 2007. ACM Press.
sets, including results from industrial studies.
                                                                         [7] Rolf Molich and Joseph S. Dumas. Comparative usability
These approaches may be used to profile methods according to                 evaluation (CUE-4). Behaviour & Information Technology,
their effectiveness regarding certain types of defects. In industrial        27(3), 2008.
settings this is useful for selecting a method appropriate to the
development context. For example, we may purposefully choose a           [8] Jakob Nielsen. Why you only need to test with 5 users. Jakob
method for identification of task related defects early in                   Nielsens       Alertbox,       March         19        2000.
development. Later in the development process another method                 http://www.useit.com/alertbox/20000319.html.
may serve identification of superficial design issues. Another           [9] Jakob Nielsen and Thomas K. Landauer. A mathematical
possibility is aligning the evaluation focus to business goals, e.g.         model of the finding of usability problems. In CHI ’93:
evaluating for efficiency in case a system is primarily aimed at             Proceedings of the SIGCHI conference on Human factors in
experts.                                                                     computing systems, pages 206–213, New York, NY, USA,
                                                                             1993. ACM Press.
                                                                         [10] Martin Schmettow. Heterogeneity in the usability evaluation
                                                                              process. In David England and Russell Beale, editors,
    Proceedings of the HCI 2008, volume 1 of People and          [12] Martin Schmettow and Wolfgang Vietze. Introducing item
    Computers, pages 89–98. British Computing Society, 2008.          response theory for measuring usability inspection processes.
    in print.                                                         In CHI 2008 Proceedings, pages 893–902. ACM SIGCHI,
[11] Martin Schmettow and Sabine Niebuhr. A pattern-based             April 2008.
     usability inspection method: First empirical performance    [13] Zhang Zhijun, Victor Basili, and Ben Shneiderman. An
     measures and future issues. In Devina Ramduny-Ellis and          empirical study of perspective based usability inspection.
     Dorothy Rachovides, editors, Proceedings of the HCI 2007,        Technical report, University of Maryland, Human-Computer
     volume 2 of People and Computers, pages 99–102. BCS,             Interaction Lab, 1998.
     September 2007.