<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>September</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Problems of Consolidating Usability Problems</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Effie Lai-Chong Law</string-name>
          <email>law@tik.ee.ethz.ch</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Iceland 107</institution>
          <addr-line>Reykjavik</addr-line>
          <country country="IS">Iceland</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University of Leicester/ ETH Zürich LE1 7RH Leicester/ Institut TIK UK/</institution>
          <country country="CH">Switzerland</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2008</year>
      </pub-date>
      <volume>24</volume>
      <issue>2008</issue>
      <abstract>
        <p>The process of consolidating usability problems (UPs) is an integral part of usability evaluation involving multiple users/analysts. However, little is known about the mechanism of this process and its effects on evaluation outcomes, which presumably influence how developers redesign the system of interest. We conducted an exploratory research study with ten novice evaluators to examine how they performed when merging UPs in the individual and collaborative setting and how they drew consensus. Our findings indicate that collaborative merging causes the absolute number of UPs to deflate, and concomitantly the frequency of certain UP types as well as their severity ratings to inflate excessively. It can be attributed to the susceptibility of novice evaluators to persuasion in a negotiation setting, and thus they tended to aggregate UPs leniently. Such distorted UP attributes may mislead the prioritization of UPs for fixing and thus result in ineffective system redesign.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Usability problems</kwd>
        <kwd>Merging</kwd>
        <kwd>Filtering</kwd>
        <kwd>Consensus building</kwd>
        <kwd>Downstream utility</kwd>
        <kwd>Severity</kwd>
        <kwd>Confidence</kwd>
        <kwd>Evaluator effect</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. INTRODUCTION</title>
      <p>
        The extent to which UPs identified by different users/analysts
overlap seems unpredictable, despite the persistent research
efforts of formalizing the cumulative relation between the
numbers of users/analysts and UPs ([
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]). The practical
implication of these concerns is to recruit as many users/analysts
as the project’s resources allow, thereby maximizing the
probability of identifying most, but impossibly all, UPs.
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that
copies bear this notice and the full citation on the first page. To copy
otherwise, or republish, to post on servers or to redistribute to lists,
requires prior specific permission and/or a fee.
      </p>
      <p>
        In the HCI literature, the UP consolidation procedure is mostly
described at a coarse-grained level. Nielson [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], when addressing
the issue of multiple users/analysts, highlighted the significance
of merging different UP lists, but he did not specify how this
should be done. Connell and Hammond [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], in comparing the
effectiveness of different UEMs, delineated the merging
procedure at a rather abstract level. Further, Hertzum and
Jacobsen [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] coined the notion of evaluator effect that has drawn
much attention from the HCI community towards the reliability
and validity issues of usability evaluation. Nonetheless, their
work focused on problem extraction on an individual basis rather
than problem merging on a collaborative basis. More recently, a
tool for merging and grouping UPs has been developed [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ],
which, however, supports the work of individual evaluators but
neglects the collaborative aspect of usability evaluation.
In summary, the actual practice of UP consolidation is largely
open, unstructured and unchecked. With the major goals to
examine the impact of the UP consolidation process and to
understand the mechanism underlying the consensus building
process, we have conducted a research study. In this paper we
summarize the main findings on the first issue while leaving out
the second one as the data are still being analyzed.
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. RESEARCH METHODS</title>
      <p>The empirical study was conducted at a university in the UK. Ten
students (one female) majored in computer science were
recruited. All have acquired reasonable knowledge of HCI and
experience in user-based evaluation through lectures and projects.
They were grouped into five pairs. An e-learning platform was
usability evaluated (i.e. think aloud) with representative end-users
one year ago. Among different types of data collected, we
employed for this current study the observational reports written
by the experimenter who was present throughout the testing
sessions and registered the users’ behaviours in very fine detail.
We also developed several structured forms to register the
participants’ findings in the different steps of our study. All the
participants had to attend two testing sessions: In the first one
they performed Individual Problem Extraction and Individual
Problem Consolidation, and about a week later, they paired up to
perform Collaborative Problem Consolidation.</p>
    </sec>
    <sec id="sec-3">
      <title>2.1 Individual Problem Extraction</title>
      <p>Each participant was given the narrative observational reports
(printed texts) how the users P1 and P2 performed Task 1 (T1)
“Browse the Catalogue” and Task 2 (T2) “Provide and Offer a
Learning Resource”. For each UP extracted, the participant was
required to record in a structured analysis form five attributes:
1. Develop UP identifier with a given format;
2. Provide a UP description as detailed as possible;
3. Select criteria from a given list to justify the UP;
4. Judge the severity level of UP: minor, moderate, severe;
5. How confident the evaluator was that the UP identified was
true: 1 lowest – 5 highest;
After completing the analysis form for T1, the participant was
asked to apply the same procedure to P1’s T2, and then to P2’s T1
and T2 (Figure 1). In other words, each participant was required
to analyse four sets of data (P1-T1, P1-T2, P2-T1 and P2-T2).</p>
    </sec>
    <sec id="sec-4">
      <title>2.2 Individual Problem Consolidation</title>
      <p>With the four lists of extracted UPs, the participant was required
to filter out any duplicate within the lists and then merge similar
UPs, resulting in two sets of UPs (i.e. P1-T1 and P2-T1 as one set;
P1-T2 and P2-T2 as another set). Unique UPs identified would be
retained or discarded during this process. The participants were
asked to record the outcomes in the same form for problem
extraction, but they needed to indicate explicitly in the column
UP-identifier which UPs were combined. Severity and confidence
levels could also be adjusted. No time limit was imposed.</p>
    </sec>
    <sec id="sec-5">
      <title>2.3 Collaborative Problem Consolidation</title>
      <p>With a break of several days, two participants of a group came
together to merge their respective lists of UPs prepared in the
individual sessions into a master list. They could access all the
materials used in the earlier sessions. They were asked to track
every item (i.e., a single UP or combined UPs) in their own
consolidated list by recording in a structured form which of the
three possible changes was made - merged (with which one),
retained or discarded. No time limit was imposed on any of the
above procedures. While individual and collaborative problem
consolidation basically involved similar sub-tasks, the latter was
conducted to observe how the collaborative setting influenced an
individual’s merging strategies.</p>
      <p>Observational</p>
      <p>Reports
P1-T1, P1-T2</p>
      <p>Observational</p>
      <p>Reports
P2-T1, P2-T2</p>
      <p>Observational</p>
      <p>Reports
P1-T1, P1-T2</p>
      <p>Observational</p>
      <p>Reports
P1-T1, P1-T2</p>
      <p>E1</p>
    </sec>
    <sec id="sec-6">
      <title>3. RESULTS</title>
    </sec>
    <sec id="sec-7">
      <title>3.1 Individual Problem Consolidation</title>
      <p>The ten participants extracted from the observational reports
altogether 98 and 81 UPs for T1 and T2 over the two users (P1
and P2), respectively. Furthermore, they individually consolidated
their UPs. Table 1 shows the extent to which the participants
merged, discarded and retained the UPs extracted.
For the merged and retained UPs, there were changes in severity
ratings and/or confidence levels or no changes at all. To simplify
the results, we collapse different degrees of increase/decrease
(e.g. minor Æ moderate/severe or vice versa) into INC or DEC,
respectively, and denote no change with SAME.
The same notations are applied to the confidence level. In
merging the UPs, the participants tended to increase the severity
ratings by one or two degrees (i.e. 37% for T1 and 22% for T2;
Table 2). In contrast, it seemed they did not bother to adjust the
severity of the UPs retained (i.e., 2% and 6% for T1 and T2,
respectively). In the post-filtering interviews, most participants
explained that when a UP was both identified in P1 and P2, it
could indicate that the UP was more severe than originally
estimated and that it rectified the realness of the problem, thereby
boosting their confidence. Interestingly, the correlation between
the original severity ratings and confidence levels (r = 0.25, n =
179, p = 0.001) was found to be significant, implying that the
participants were more confident that they judged the severe UPs
correctly but less so when judging minor or moderate UPs. In
contrast, the correlation between the changes in both variables (r
= 0.19, n = 26) was insignificant. In other words, changing the
severity of a UP does not imply that the participant has become
more (or less) confident about the realness of the UP.</p>
    </sec>
    <sec id="sec-8">
      <title>3.2 Collaborative Problem Consolidation</title>
      <p>In comparison, the participants demonstrated an even stronger
tendency to merge UPs in a collaborative setting (Table 3), which
is higher than that (cf. 39% vs. 81% for T1; 51% vs. 77% for T2)
observed in an individual session. The participants tended to
negotiate at a higher abstract level where broad problem types can
accommodate a variety of problem instances, thus mitigating
direct confrontation with partners over controversial similarities.
The participants tended to receptive to their partners’ proposals,
especially when the agreement thus reached would not cause any
actual economic or personal gain (or loss). When negotiating to
merge or retain UPs, the participants adjusted the severity and
confidence ratings. For each aggregate we averaged the ratings of
the original set of to-be-merged UPs and compared it with the
corresponding final ratings. Table 4 displays the results for the
merged UPs. Similar patterns to Table 1 were observed.</p>
    </sec>
    <sec id="sec-9">
      <title>4. DISCUSSION</title>
      <p>
        The empirical findings of this study enable us to draw
comparisons between the individual and collaborative UP
consolidation processes, which presumably involve the core
mechanism of judging similarity among UPs. One notable
distinction is the lenience towards merging in the collaborative
setting, as shown by the high merging rate. Indeed, quite a
number of participants combined UPs that had not been merged in
their individual sessions to merge with their partners’. It may be
attributed to social pressure that coerces them to reach consensus.
The data indicate that as a result of the merging process, severity
ratings of UPs tend to inflate and the number of UPs tends to
deflate excessively in the collaborative setting. In contrast,
confidence levels, in which personal experience plays a role, do
not fluctuate with the merging process. Previous research studies
indicate that severity ratings influence how developers and project
managers prioritize which UPs to fix ([
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]). Invalid severity
ratings presumably lead to the fixing of less urgent UPs.
Consequently, the quality of the system may still be undermined
by more severe as well as more urgent UPs.
      </p>
      <p>The implication for the future work is to look into relevant
theories on similarity (an age-old issue), communication, and
social interaction. Further, we aim to extend our empirical studies
by systematically comparing merging through negotiation (i.e. the
consolidation procedure is to be implemented by a group of two
or three usability specialists or a group of developers or an
integrated team) versus merging through authority (i.e. only one
person-in-charge is to combine different lists of UPs). The quality
of the consolidated usability outcomes will be compared, thereby
enabling us to identify valid and reliable methods for
consolidating UPs and to develop objective measures of the
costeffectiveness of such methods. Findings thus obtained will also
contribute to our ongoing research endeavour on downstream
utility.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Connell</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Hammond</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          (
          <year>1999</year>
          ).
          <article-title>Comparing usability evaluation principles with heuristics: Problem instances vs. problem types</article-title>
          .
          <source>Proc. INTERACT</source>
          <year>1999</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <surname>Hassenzahl</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          (
          <year>2000</year>
          ).
          <article-title>Prioritizing usability problems: datadriven and judgement-driven severity estimates</article-title>
          .
          <source>Behaviour &amp; Information Technology</source>
          ,
          <volume>19</volume>
          (
          <issue>1</issue>
          ),
          <fpage>29</fpage>
          -
          <lpage>42</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Hertzum</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          (
          <year>2006</year>
          ).
          <article-title>Problem prioritization in usability evaluation: From severity assessments toward impact on design</article-title>
          .
          <source>International Journal of Human Computer Interaction (IJHCI)</source>
          ,
          <volume>21</volume>
          (
          <issue>2</issue>
          ),
          <fpage>125</fpage>
          -
          <lpage>146</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Hertzum</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Jacobsen</surname>
            ,
            <given-names>N.E.</given-names>
          </string-name>
          (
          <year>2003</year>
          ).
          <article-title>The evaluator effect: A chilling fact about usability evaluation methods</article-title>
          .
          <source>IJHCI</source>
          ,
          <volume>15</volume>
          (
          <issue>1</issue>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Howarth</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          (
          <year>2007</year>
          ).
          <article-title>Supporting novice usability practitioners with usability engineering tools</article-title>
          .
          <source>PhD thesis</source>
          (VT).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Law</surname>
            ,
            <given-names>E. L.-C.</given-names>
          </string-name>
          (
          <year>2006</year>
          ).
          <article-title>Evaluating the Downstream Utility of User Tests and Examining the Developer Effect: A Case Study</article-title>
          .
          <source>International Journal of Human Computer Interaction (IJHCI)</source>
          ,
          <volume>21</volume>
          (
          <issue>2</issue>
          ),
          <fpage>147</fpage>
          -
          <lpage>172</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>Law</surname>
            ,
            <given-names>E. L-C.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Hvannberg</surname>
            ,
            <given-names>E. T.</given-names>
          </string-name>
          (
          <year>2004</year>
          ).
          <article-title>Analysis of combinatorial user effect in international usability test</article-title>
          .
          <source>Proc. CHI</source>
          <year>2004</year>
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <surname>Lewis</surname>
            ,
            <given-names>J.R.</given-names>
          </string-name>
          (
          <year>1994</year>
          ).
          <article-title>Sample sizes for usability studies: Additional considerations</article-title>
          .
          <source>Human Factors</source>
          ,
          <volume>36</volume>
          (
          <issue>2</issue>
          ),
          <fpage>368</fpage>
          -
          <lpage>378</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <surname>Nielsen</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          (
          <year>1994</year>
          ).
          <article-title>Heuristic evaluation</article-title>
          . In J. Nielsen &amp; R.L. Mack (Eds.),
          <article-title>Usability inspection methods</article-title>
          . New York: Wiley
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Virzi</surname>
            ,
            <given-names>R.A.</given-names>
          </string-name>
          (
          <year>1992</year>
          ).
          <article-title>Refining the test phase of usability evaluation: How many subjects is enough? Human Factors</article-title>
          ,
          <volume>34</volume>
          (
          <issue>4</issue>
          ),
          <fpage>457</fpage>
          -
          <lpage>468</lpage>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>