Towards Industrial-Strength Usability Evaluation Martin Schmettow Passau University Informations Systems II 94032 Passau, Germany schmettow@web.de ABSTRACT Users debate is about how to reliably plan and control usability Usability professionals may face strict economic demands on the evaluation studies, whereas the Damaged Merchandise debate usability process in near future. This position paper outlines a treats the topic of how to compare evaluation methods in fair and research agenda to make usability evaluation a predictable and valid way. In the following, I will argue why we must continue highly efficient engineering process. these research agendas, in order to make usability evaluation a well understood and highly optimized engineering activity. But, I will also claim that we have to put off some blinders. Categories and Subject Descriptors H.5.2 User Interfaces (e.g. HCI) Evaluation/methodology 2.WHY TO CONTINUE THE “FIVE USERS” Keywords DEBATE The five users debate goes back to Nielsen and Landauers Usability Evaluation, Measurement, Process Quality suggestion to model the progress of evaluation studies as a geometric series [9]. Unfortunately, the debate was primarily 1.MOTIVATION carried by an oversimplification of Nielsen, who trivialized his Usability professionals are never tired to stress the economic own findings in stating that testing five users is enough in impact of good usability. And indeed, there are several industrial practice [8]. This is, by the way, an excellent example compelling arguments: The first may be derived from the ISO of the discount usability paradigm, which may turn out obsolete. norm 9241-11: Efficiency is regarded as one of the three main In contrast, several researchers went deeper into the theoretical criteria of usability and can directly be converted into a bargain. impact of this model: The phenomenon of variance in the process For example, a very efficient interface to an enterprise was discovered [3], good task design was found to be a major information system makes users do their tasks more quickly impact factor [6] and basic stochastic assumptions of the model which increases overall throughput. The second argument is were questioned [2]. A recent contribution was the proof that the specific to web usability. Web users are known to be very geometric model is inherently flawed by falsely assuming that impatient with web sites having poor usability, especially with usability defects are equally visible and sessions equally effective online purchasing; consequently usability directly affects the [10]. Instead, the beta-geometric model, accounting for conversion rate of e-commerce companies. The third argument is heterogeneity, was shown to better predict the process. from the perspective of software development. It is a widely accepted law, that defect fixing costs overlinearly depend on how But, this is still an oversimplification that does not comprise all early a defect was introduced and how late it was found. This is a impact factors found in industrial studies. For example, recently I justification for doing intensive usability evaluation early in tried to fit the data reported from the CUE-4 study with the beta- system development. geometric model – with disappointing results: The model could not sufficiently explain the overwhelming number of defects that But, many usability professionals still act under the paradigm of were detected only once [7]. In consequence, there is still no discount usability. In a broad sense this denotes: usability reliable estimation of how many defects were left undetected. For evaluation as a best effort strategy and conducted iteratively by the first, there are two options for enhancing the model in order to experts who just know what they are doing. What, if clients or better fit the data and reliably plan and control usability studies: employers of usability professionals start taking the above First, the study progress has to be tracked on the finer grained economic arguments seriously? For example: What, if a start-up level of single tasks presented in a usability test (or imagined by company has an innovative product idea and plenty of venture usability inspectors). Specifically, this may help identify when a capital, but usability is mission-critical and they have only one certain set of tasks is “exhausted” and replace it by new tasks that shot? Will they rely on discount usability? Will they accept the make further defects observable. Second, the current models do good reputation of a usability company as the only guarantee? It not handle the problem of false alarms in evaluation studies. is more likely, that they want objective preconditions, like a These may well be liable for the misfit reported above. Currently, proven and certified evaluation plan. And maybe they even want we are working on an enhanced model to incorporate the quantitative guarantees and proven contract fulfillment, like: occurrence of false alarms and varying task sets. This hopefully There is no show stopper left in the system and at least 90% of enables us to better estimate the number of remaining defects serious problems are identified. The paradigm of discount (misses) and to give a probability for a reported defect being a usability is inappropriate in such cases. false alarm. The latter may prevent wasting development Research on the usability evaluation process has seen two major resources on would-be defects and thus has direct economic debates (research agendas, respectively): The Five-Users-Is-Not- impact. Enough debate and the Damaged Merchandise debate. The Five 3.BEYOND “CHASING THE HE” 4.CONCLUSION The Damaged Merchandise debate arouse by the harsh critique of Modern software engineering is well regarding economic Gray and Salzmann on the poor validity of experiments on UEMs demands: efficiency of development processes, early defect [5]. However, my main point here is not validity, but the discovery and aligning software qualities to business goals. The observation that research on designing UEMs has not made much usability profession is still dragging a little behind, but may substantial progress. Even recent well designed studies are still sometimes face their customers’ claims for process approval, very restricted in their contribution to understanding the cognitive efficiency and guarantees. The aim of this paper was to point out or contextual factors of finding usability defects. Instead, they valuable research agendas in the past, but to also identify future make more or less marginal adaptions to common inspection directions of research: Quantitative research with refined methods and compare this in a two conditional experimental experimental designs and advanced statistical techniques may design to the Heuristic Evaluation (HE). The observed reveal relevant properties on several levels of the usability effectiveness gains are in many cases marginal (e.g. [4]) or non- evaluation process. Knowing the properties on process level existent [11]). This “Chasing the HE” approach has the severe results in better approaches to plan and control studies towards drawback of restricted insight. It lets us only know which of two given business goals. Knowing the properties on the cognitive- procedures is (slightly) better. It does not inform about the behavioral level are a precondition to significantly raise specific interplay of impact factors granting effective defect effectiveness and appropriateness of evaluation processes. Much identification. But, this is a precondition to design (much) better can be achieved with advanced statistical techniques on existing procedures, provide adequate training and adjust the evaluation data sets. The minimum to get is specific and well grounded process to business goals. hypotheses that will inspire for well designed and elaborate Only few studies have paid attention to successful versus experimental studies to deeply understand the anatomy of unsuccessful cognitive-behavioral strategies of usability experts. usability evaluation. To give an example for a rarely recognized work that has done better: Perspective based reading is a well known technique in 5.REFERENCES software inspections and raises effectiveness by reducing [1] Alan Woolrych, Gilbert Cockton, and Mark Hindmarch. cognitive load. Zhang et. al. have transferred this technique to Knowledge Resources in Usability Inspection. In usability inspection and have found likewise improvements [13]. Proceedings of the HCI 2005, 2005. Another positive example is how Woolrych et. al. analyzed the [2] David A. Caulton. Relaxing the homogeneity assumption in knowledge resources involved in usability inspections [1]. (They usability testing. Behaviour & Information Technology, also made some points on how false alarms arise.) 20(1):1–7, 2001. These are interesting and relevant results, as they may lead to [3] Laura Faulkner. Beyond the five-user assumption: Benefits methods and training concepts for increased effectiveness of of increased sample sizes in usability testing. Behavior usability experts. But, there still is a lack of quantitative research Research Methods, Instruments & Computers, 35(3):379– on such topics. Especially, defects are likely having qualitative 383, 2003. properties that make a difference with respect to behavioral strategies and knowledge resources. Frøkjær and Hornbæk have [4] Erik Frøkjær and Kasper Hornbæk. Metaphors of human found differing detection profiles for two inspection methods after thinking for usability inspection and design. ACM Trans. classifying defects with the User Action Framework [4]. Another Comput.-Hum. Interact., 14(4):1–33, 2008. promising way to go is to search for defect classes in the raw data [5] Wayne D. Gray and Marilyn C. Salzman. Damaged from evaluation processes and derive an empirically valid merchandise? A review of experiments that compare classification Advanced statistical exploration techniques, like usability evaluation methods. Human-Computer Interaction, differential item functioning from item response theory [12] or 13(3):203–261, 1998. binary cluster analysis probably apply well to this problem, in [6] Gitte Lindgaard and Jarinee Chattratichart. Usability testing: contrast to ordinary variance analysis. The strength of these What have we overlooked? In CHI ’07: Proceedings of the techniques is that they to not require manipulating independent SIGCHI conference on Human factors in computing systems, variables. Instead, they can reveal latent variables in existing data pages 1415–1424, New York, NY, USA, 2007. ACM Press. sets, including results from industrial studies. [7] Rolf Molich and Joseph S. Dumas. Comparative usability These approaches may be used to profile methods according to evaluation (CUE-4). Behaviour & Information Technology, their effectiveness regarding certain types of defects. In industrial 27(3), 2008. settings this is useful for selecting a method appropriate to the development context. For example, we may purposefully choose a [8] Jakob Nielsen. Why you only need to test with 5 users. Jakob method for identification of task related defects early in Nielsens Alertbox, March 19 2000. development. Later in the development process another method http://www.useit.com/alertbox/20000319.html. may serve identification of superficial design issues. Another [9] Jakob Nielsen and Thomas K. Landauer. A mathematical possibility is aligning the evaluation focus to business goals, e.g. model of the finding of usability problems. In CHI ’93: evaluating for efficiency in case a system is primarily aimed at Proceedings of the SIGCHI conference on Human factors in experts. computing systems, pages 206–213, New York, NY, USA, 1993. ACM Press. [10] Martin Schmettow. Heterogeneity in the usability evaluation process. In David England and Russell Beale, editors, Proceedings of the HCI 2008, volume 1 of People and [12] Martin Schmettow and Wolfgang Vietze. Introducing item Computers, pages 89–98. British Computing Society, 2008. response theory for measuring usability inspection processes. in print. In CHI 2008 Proceedings, pages 893–902. ACM SIGCHI, [11] Martin Schmettow and Sabine Niebuhr. A pattern-based April 2008. usability inspection method: First empirical performance [13] Zhang Zhijun, Victor Basili, and Ben Shneiderman. An measures and future issues. In Devina Ramduny-Ellis and empirical study of perspective based usability inspection. Dorothy Rachovides, editors, Proceedings of the HCI 2007, Technical report, University of Maryland, Human-Computer volume 2 of People and Computers, pages 99–102. BCS, Interaction Lab, 1998. September 2007.