=Paper=
{{Paper
|id=Vol-407/paper-8
|storemode=property
|title=Problems of Consolidating Usability Problems
|pdfUrl=https://ceur-ws.org/Vol-407/paper8.pdf
|volume=Vol-407
|dblpUrl=https://dblp.org/rec/conf/iused/LawH08
}}
==Problems of Consolidating Usability Problems==
Problems of Consolidating Usability Problems
Effie Lai-Chong Law Ebba Thora Hvannberg
University of Leicester/ ETH Zürich University of Iceland
LE1 7RH Leicester/ Institut TIK 107 Reykjavik
UK/Switzerland Iceland
+44 116 2717302 +354 525 4702
law@tik.ee.ethz.ch ebba@hi.is
ABSTRACT
The process of consolidating usability problems (UPs) is an One concomitant procedure of involving multiple users/analysts
integral part of usability evaluation involving multiple in usability evaluation is to consolidate UPs identified by
users/analysts. However, little is known about the mechanism of different users/analysts to produce a master list. Such a
this process and its effects on evaluation outcomes, which consolidation process can serve two purposes: (i) providing a
presumably influence how developers redesign the system of design team with neat and clean information to facilitate system
interest. We conducted an exploratory research study with ten redesign, and (ii) enhancing the validity of comparing the
novice evaluators to examine how they performed when merging effectiveness of different (instances of) usability evaluation
UPs in the individual and collaborative setting and how they drew methods (UEMs). This process consists of two phases [1]: The
consensus. Our findings indicate that collaborative merging first step is known as filtering, that is, to eliminate duplicates
causes the absolute number of UPs to deflate, and concomitantly within a list of UPs identified by a user when performing a certain
the frequency of certain UP types as well as their severity ratings task with the system under scrutiny or by an analyst when
to inflate excessively. It can be attributed to the susceptibility of inspecting it. The second step is merging, that is, to combine UPs
novice evaluators to persuasion in a negotiation setting, and thus between different lists identified by multiple users/analysts, to
they tended to aggregate UPs leniently. Such distorted UP retain unique, relevant ones, and to discard unique, irrelevant
attributes may mislead the prioritization of UPs for fixing and ones. While such consolidation procedures are commonly
thus result in ineffective system redesign. practised by usability professionals and researchers, little is
known about how it is exactly done and what impact it can have
Categories and Subject Descriptors on final evaluation outcomes and eventually on system redesigns,
H.5.2 [User Interfaces]: Evaluation/Methodology especially when severity ratings play a non-trivial role in the
prioritization strategy for UP fixing ([2], [3]).
General Terms In the HCI literature, the UP consolidation procedure is mostly
Measurement, Performance, Experimentation, Theory described at a coarse-grained level. Nielson [9], when addressing
the issue of multiple users/analysts, highlighted the significance
of merging different UP lists, but he did not specify how this
Keywords should be done. Connell and Hammond [1], in comparing the
Usability problems, Merging, Filtering, Consensus building, effectiveness of different UEMs, delineated the merging
Downstream utility, Severity, Confidence, Evaluator effect procedure at a rather abstract level. Further, Hertzum and
Jacobsen [4] coined the notion of evaluator effect that has drawn
1. INTRODUCTION much attention from the HCI community towards the reliability
The extent to which UPs identified by different users/analysts and validity issues of usability evaluation. Nonetheless, their
overlap seems unpredictable, despite the persistent research work focused on problem extraction on an individual basis rather
efforts of formalizing the cumulative relation between the than problem merging on a collaborative basis. More recently, a
numbers of users/analysts and UPs ([7], [8], [10]). The practical tool for merging and grouping UPs has been developed [5],
implication of these concerns is to recruit as many users/analysts which, however, supports the work of individual evaluators but
as the project’s resources allow, thereby maximizing the neglects the collaborative aspect of usability evaluation.
probability of identifying most, but impossibly all, UPs. In summary, the actual practice of UP consolidation is largely
open, unstructured and unchecked. With the major goals to
examine the impact of the UP consolidation process and to
understand the mechanism underlying the consensus building
Permission to make digital or hard copies of all or part of this work for process, we have conducted a research study. In this paper we
personal or classroom use is granted without fee provided that copies are summarize the main findings on the first issue while leaving out
not made or distributed for profit or commercial advantage and that the second one as the data are still being analyzed.
copies bear this notice and the full citation on the first page. To copy
otherwise, or republish, to post on servers or to redistribute to lists,
requires prior specific permission and/or a fee. 2. RESEARCH METHODS
The empirical study was conducted at a university in the UK. Ten
I-USED’08, September 24, 2008, Pisa, Italy students (one female) majored in computer science were
recruited. All have acquired reasonable knowledge of HCI and
experience in user-based evaluation through lectures and projects. and T2 (Figure 1). In other words, each participant was required
They were grouped into five pairs. An e-learning platform was to analyse four sets of data (P1-T1, P1-T2, P2-T1 and P2-T2).
usability evaluated (i.e. think aloud) with representative end-users
one year ago. Among different types of data collected, we 2.2 Individual Problem Consolidation
employed for this current study the observational reports written With the four lists of extracted UPs, the participant was required
by the experimenter who was present throughout the testing to filter out any duplicate within the lists and then merge similar
sessions and registered the users’ behaviours in very fine detail. UPs, resulting in two sets of UPs (i.e. P1-T1 and P2-T1 as one set;
We also developed several structured forms to register the P1-T2 and P2-T2 as another set). Unique UPs identified would be
participants’ findings in the different steps of our study. All the retained or discarded during this process. The participants were
participants had to attend two testing sessions: In the first one asked to record the outcomes in the same form for problem
they performed Individual Problem Extraction and Individual extraction, but they needed to indicate explicitly in the column
Problem Consolidation, and about a week later, they paired up to UP-identifier which UPs were combined. Severity and confidence
perform Collaborative Problem Consolidation. levels could also be adjusted. No time limit was imposed.
2.1 Individual Problem Extraction 2.3 Collaborative Problem Consolidation
Each participant was given the narrative observational reports With a break of several days, two participants of a group came
(printed texts) how the users P1 and P2 performed Task 1 (T1) together to merge their respective lists of UPs prepared in the
“Browse the Catalogue” and Task 2 (T2) “Provide and Offer a individual sessions into a master list. They could access all the
Learning Resource”. For each UP extracted, the participant was materials used in the earlier sessions. They were asked to track
required to record in a structured analysis form five attributes: every item (i.e., a single UP or combined UPs) in their own
1. Develop UP identifier with a given format; consolidated list by recording in a structured form which of the
2. Provide a UP description as detailed as possible; three possible changes was made - merged (with which one),
3. Select criteria from a given list to justify the UP; retained or discarded. No time limit was imposed on any of the
4. Judge the severity level of UP: minor, moderate, severe; above procedures. While individual and collaborative problem
5. How confident the evaluator was that the UP identified was consolidation basically involved similar sub-tasks, the latter was
true: 1 lowest – 5 highest; conducted to observe how the collaborative setting influenced an
After completing the analysis form for T1, the participant was individual’s merging strategies.
asked to apply the same procedure to P1’s T2, and then to P2’s T1
Observational Observational Observational Observational
Reports Reports Reports Reports
P1-T1, P1-T2 P2-T1, P2-T2 P1-T1, P1-T2 P1-T1, P1-T2
Problem
Extraction
E1 E2
UPs UPs UPs UPs UPs UPs UPs UPs Individual
from from from from from from from from Problem
P1-T1 P2-T1 P1-T2 P2-T2 P1-T1 P2-T1 P1-T2 P2-T2 Filtering and
Merging
Merged list Merged list Merged list Merged list
of UPs for T1 of UPs for T2 of UPs for T1 of UPs for T2
Collaborative
Problem
Filtering and
Merging
Consolidated lists Consolidated lists
of UPs for T1 of UPs for T2
Figure 1: The workflow of problem consolidating process
3. RESULTS corresponding final ratings. Table 4 displays the results for the
merged UPs. Similar patterns to Table 1 were observed.
3.1 Individual Problem Consolidation
The ten participants extracted from the observational reports Table 3. Distribution of outcomes in the collaborative filtering
altogether 98 and 81 UPs for T1 and T2 over the two users (P1 Merged Discarded Retained
and P2), respectively. Furthermore, they individually consolidated
T1 81% 10% 9%
their UPs. Table 1 shows the extent to which the participants
T2 77% 15% 8%
merged, discarded and retained the UPs extracted.
Table 1. Distribution of outcomes in the individual filtering Table 4. Severity/confidence changes in merged UPs (collab.)
Severity Confidence
Merged Discarded Retained T1 T2 T1 T2
T1 39% 13% 48% DEC 2 (5%) 2 (7%) 2 (5%) 3 (11%)
T2 51% 10% 39% SAME 23 (52%) 16 (57%) 22 (50%) 13 (46%)
For the merged and retained UPs, there were changes in severity INC 22 (43%) 10 (36%) 19 (45%) 12 (43%)
ratings and/or confidence levels or no changes at all. To simplify
the results, we collapse different degrees of increase/decrease
(e.g. minor Æ moderate/severe or vice versa) into INC or DEC, 4. DISCUSSION
respectively, and denote no change with SAME. The empirical findings of this study enable us to draw
comparisons between the individual and collaborative UP
Table 2. Severity/confidence changes in merged UPs (Indiv.) consolidation processes, which presumably involve the core
Severity Confidence mechanism of judging similarity among UPs. One notable
T1 T2 T1 T2 distinction is the lenience towards merging in the collaborative
setting, as shown by the high merging rate. Indeed, quite a
DEC 4 (10%) 3 (7%) 6 (15%) 4 (10%)
number of participants combined UPs that had not been merged in
SAME 20 (53%) 29 (71%) 15 (40%) 18 (44%)
their individual sessions to merge with their partners’. It may be
INC 14 (37%) 9 (22%) 17 (45%) 19 (46%)
attributed to social pressure that coerces them to reach consensus.
The same notations are applied to the confidence level. In The data indicate that as a result of the merging process, severity
merging the UPs, the participants tended to increase the severity ratings of UPs tend to inflate and the number of UPs tends to
ratings by one or two degrees (i.e. 37% for T1 and 22% for T2; deflate excessively in the collaborative setting. In contrast,
Table 2). In contrast, it seemed they did not bother to adjust the confidence levels, in which personal experience plays a role, do
severity of the UPs retained (i.e., 2% and 6% for T1 and T2, not fluctuate with the merging process. Previous research studies
respectively). In the post-filtering interviews, most participants indicate that severity ratings influence how developers and project
explained that when a UP was both identified in P1 and P2, it managers prioritize which UPs to fix ([3], [6]). Invalid severity
could indicate that the UP was more severe than originally ratings presumably lead to the fixing of less urgent UPs.
estimated and that it rectified the realness of the problem, thereby Consequently, the quality of the system may still be undermined
boosting their confidence. Interestingly, the correlation between by more severe as well as more urgent UPs.
the original severity ratings and confidence levels (r = 0.25, n =
The implication for the future work is to look into relevant
179, p = 0.001) was found to be significant, implying that the
theories on similarity (an age-old issue), communication, and
participants were more confident that they judged the severe UPs
social interaction. Further, we aim to extend our empirical studies
correctly but less so when judging minor or moderate UPs. In
by systematically comparing merging through negotiation (i.e. the
contrast, the correlation between the changes in both variables (r
consolidation procedure is to be implemented by a group of two
= 0.19, n = 26) was insignificant. In other words, changing the
or three usability specialists or a group of developers or an
severity of a UP does not imply that the participant has become
integrated team) versus merging through authority (i.e. only one
more (or less) confident about the realness of the UP.
person-in-charge is to combine different lists of UPs). The quality
of the consolidated usability outcomes will be compared, thereby
3.2 Collaborative Problem Consolidation enabling us to identify valid and reliable methods for
In comparison, the participants demonstrated an even stronger consolidating UPs and to develop objective measures of the cost-
tendency to merge UPs in a collaborative setting (Table 3), which effectiveness of such methods. Findings thus obtained will also
is higher than that (cf. 39% vs. 81% for T1; 51% vs. 77% for T2) contribute to our ongoing research endeavour on downstream
observed in an individual session. The participants tended to utility.
negotiate at a higher abstract level where broad problem types can
accommodate a variety of problem instances, thus mitigating
direct confrontation with partners over controversial similarities.
5. REFERENCES
[1] Connell, I., & Hammond, N. (1999). Comparing usability
The participants tended to receptive to their partners’ proposals,
evaluation principles with heuristics: Problem instances vs.
especially when the agreement thus reached would not cause any
problem types. Proc. INTERACT 1999.
actual economic or personal gain (or loss). When negotiating to
merge or retain UPs, the participants adjusted the severity and [2] Hassenzahl, M. (2000). Prioritizing usability problems: data-
confidence ratings. For each aggregate we averaged the ratings of driven and judgement-driven severity estimates. Behaviour &
the original set of to-be-merged UPs and compared it with the Information Technology, 19(1), 29-42.
[3] Hertzum, M. (2006). Problem prioritization in usability [7] Law, E. L-C., & Hvannberg, E. T. (2004). Analysis of
evaluation: From severity assessments toward impact on combinatorial user effect in international usability test. Proc.
design. International Journal of Human Computer CHI 2004
Interaction (IJHCI), 21(2), 125-146. [8] Lewis, J.R. (1994). Sample sizes for usability studies:
[4] Hertzum, M., & Jacobsen, N.E. (2003). The evaluator effect: Additional considerations. Human Factors, 36(2), 368-378.
A chilling fact about usability evaluation methods. IJHCI, [9] Nielsen, J. (1994). Heuristic evaluation. In J. Nielsen & R.L.
15(1). Mack (Eds.), Usability inspection methods. New York: Wiley
[5] Howarth, J. (2007). Supporting novice usability practitioners [10] Virzi, R.A. (1992). Refining the test phase of usability
with usability engineering tools. PhD thesis (VT). evaluation: How many subjects is enough? Human Factors,
[6] Law, E. L.-C. (2006). Evaluating the Downstream Utility of 34(4), 457-468
User Tests and Examining the Developer Effect: A Case
Study. International Journal of Human Computer Interaction
(IJHCI), 21(2), 147-172.