User-Centered Evaluation of an Adaptive User Interface in
            the Context of Warehouse Picking
               Jörg Rett                                          Yucheng Jin                                Sara Bongartz
               SAP AG                                               SAP AG                                      SAP AG
          Darmstadt, Germany                                   Darmstadt, Germany                         Darmstadt, Germany
          joerg.rett@sap.com                                  yucheng.jin@sap.com                        sara.bongartz@sap.com


ABSTRACT                                                                       of users, but also test and evaluate the prototypes with users
Although nowadays adaptive user interfaces (AUIs) can be                       in different design phases. Such iterative evaluations can be
found in many applications, many downsides and                                 named user-centred evaluations (UCE) and are necessary
jeopardies of AUIs are not yet sufficiently researched. We                     for the successes of adaptive systems, by making the
take a user-centered design in the development of an                           designers understand the users’ experience and learning
adaptive application and demonstrate that the user-                            process of adaptation rules. UCE aims to verify the quality
friendliness of an adaptive application benefits from an                       of a product, detect problems and support decisions [3] and
early and iterative evaluation of the adaptation rules.                        find and solve problems in time. As a result, the system can
Drawbacks of adaptive interfaces are discovered and solved                     be more easily adopted by users; with a greater ease of use
in our evaluation- and design-process and recommendations                      and more pleasant user experience.
for the development of adaptive systems are given.
                                                                               The goals of different phases in the iterative design process
Author Keywords                                                                according to [3] are shown in Fig. 1. In the paper at hand,
Adaptive service front-ends, Context-aware user interfaces,                    the authors focus on the phases associated with “detecting
Warehouse picking system, User-centered evaluation                             problems”. These phases involve low-fidelity and high-
                                                                               fidelity prototypes. According to the concept of UCE,
ACM Classification Keywords                                                    application prototypes should be evaluated at each level to
H.5.2 Evaluation/methodology; Prototyping; User-centered                       assure a successful design process.
design; Artificial, augmented, and virtual realities; Audio
input/output

INTRODUCTION
Nowadays, intelligent systems and ubiquitous computing
technologies make people interact with computers in a
personalized and smart way. Along with these trends,
adaptive user interfaces (AUIs) intend to provide an
effective way of interaction between humans and
computers, e.g. by adapting to users’ profiles and the
context of use. AUIs have been applied in many areas like
medical treatment, education, transport etc. However, in
                                                                                Fig.1 Phases of the iterative design process (according to [3])
practice, there are still many shortcomings and open
questions of AUIs. Careful development and evaluation of                       In this paper, we present a prototype of an adaptive
adaptive features is crucial for successful AUIs.                              warehouse order picking system consisting of an adaptive,
                                                                               context-sensitive UI which is based on an architecture for
By applying a user-centred design (UCD) methodology, the                       context-sensitive service front-ends (for details on the
needs, desires, and limitations of end users of a product are
                                                                               architecture see [1]) which we evaluated in different phases
given extensive attention at each stage of the design
                                                                               according to the principles of UCE. Based on a first user
process. As a multi-stage problem solving process, not only                    study result of a low-fidelity prototype, we extracted
UCD requires designers to analyse and design in the view                       usability problems specific to the adaptive features of the
                                                                               application and conducted a second user study with an
                                                                               improved high-fidelity prototype. Finally, we draw some
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are      conclusions regarding the design of AUIs and provide
not made or distributed for profit or commercial advantage and that copies     indications for future work.
bear this notice and the full citation on the first page. To copy otherwise,
or republish, to post on servers or to redistribute to lists, requires prior
specific permission and/or a fee.
ADAPTIVE PROTOTYPE                                                with arrow and red start point). Users can switch between
Warehouse picking is a part of a logistics process often          the four views by speaking the name of the respective tabs.
found in retail and manufacturing industries. The adaptive
application presented here is enhanced with context aware         A Head-Mounted Display (HMD) and a wearable computer
features which consider user-related aspects (tasks to            are used to access the application. The UIs are implemented
accomplish, personal preferences and knowledge, etc.),            in HTML5, JavaScript and AJAX. The navigation route in
technical aspects (available interaction resources,               the Map view is drawn using the canvas label of HTML5.
connectivity support, etc.) and environmental aspects (level      Speech recognition is realized using the Google speech
of noise, light, etc.).                                           recognition engine. The architecture of the application
                                                                  implementation is shown in Fig.3. The display is used for
The graphical user interface (GUI) consists of four views         the visual output, the earphone for the vocal output and the
(Order, Map, Task and Report), for the sake of brevity only       microphone for the vocal input of the user.
the Order view and the Map view are discussed. The Order
view (shown in Fig.2) mainly contains information on the
previous (i.e. shelf 451), the current (i.e. shelf 436) and the
next (i.e. shelf 448) items to be picked. This sequence of
picks is represented in three rows starting with the previous
pick and having the current pick highlighted (i.e. inverted)
and magnified.


                                                                          a)


a)                             b)


                                                                               b)

                                                                  Fig. 3. a) Architecture of the prototype. b) Picking from a shelf
                                                                                   using a Head-Mounted Display

                                                                  The basic interaction sequence (i.e. the basic interaction
                                                                  flow) with an example for an adaption is shown in Fig. 4:
c)                             d)                                 the picker is presented with three screens and two vocal
                                                                  outputs (upper balloons) and needs to perform two vocal
Fig. 2. Design of the graphical user interface (GUI). a) Order    inputs (lower balloons). Assuming that a picker who is
  view b) Map view c) User support in fragile mode d) User        experienced, i.e. has been working for a long time in the
                     support in noisy mode                        warehouse environment and thus should know by heart the
                                                                  location of the shelves, the Map view can be omitted. We
The columns reflect the types of information available for        assume that an indicator of the experience level is stored
the pick (status, shelf, compartment, amount and container)       within the profile of the picker and is added as context
while only the status of the pick (e.g. open), the shelf          information at run-time during the log-in procedure.
identifier (e.g. 473) and the amount of items to be picked
(e.g. 7) are relevant here. The active view is reflected as a     Table 1 lists the five variations of the context and its
highlighted tab in the bottom area. The main information in       consequences for the interaction modalities with respect to
the Map view is a simplified representation of the location       the basic interaction flow. The adaptation server sends the
of the shelves (in Bird eyes view) showing the current            updated data to the wearable computer after a change in the
location of the picker (i.e. the previous shelf), the             context has triggered the execution of an adaptation rule.
destination shelf (i.e. 473) and a suggested route (green line    Some changes might be triggered by the smart environment
(e.g. tracking of the picker’s position or the item’s                usefulness, comprehensibility or simplicity, which were
location).                                                           assessed by a questionnaire.
                                                                     To address such issues, the five adaptation rules were the
                                                                     independent variables. We had a within-subject design,
                                                                     meaning that every participant was confronted with every
                                                                     adaptation rule. The dependent variables were the
                                                                     subjectively perceived quality of the adaptation rule as
                                                                     assessed in a 9-item questionnaire. The questions originated
                                                                     from a list of non-functional requirements for the prototype
                                                                     identified in user studies in the beginning of the project and
                                                                     aimed at assessing the following aspects: the user’s
                                                                     awareness for the adaptation rule, its appropriateness and
Fig. 4. Basic interaction flow with adaptation: the execution of     comprehensibility, its effectiveness with respect to
the rule for an experienced picker omits the appearance of the       performance and usability, its error-prevention, continuity,
                     Map view (dotted line)                          intuitiveness, and general likeability.
USER CENTRED EVALUATION                                              Participants were company staff or students of the local
Following the principles of UCE in the design process of             university. A total of 10 participants took part in the study,
our AUI, we conducted two evaluations, one with a low-               9 were male and 1 was female. The average age of
fidelity prototype and another one with a high-fidelity one.         participants was 24 years (SD = 1.82). The technical set-up
Addressing usability problems found in the first study, the          consisted of an HMD with earphone worn by the
second study was aimed at evaluating the effect of                   participants. The device presented the GUI and the vocal
subsequent improvements on the prototype.                            output as shown in section 7. The sequence of the
                                                                     interaction was controlled by the moderator simulating the
In order make both studies statistically and conceptually            change of context and the execution of the adaptation rule.
comparable, we use the same questionnaires and study
design in both studies. We present and compare the results           Participants were first introduced into the scenario and the
of the two user studies and draw conclusions regarding the           interface, i.e. getting familiar with the hypothetical situation
design of AUIs.                                                      in the warehouse and learning how to interact with the
                                                                     interface. Participants were asked to play through a “basic
Context variation            Interaction consequence                 interaction flow” which started with the systems request to
The items to be picked are   After vocally confirming the arrival    pick items from a certain shelf, required the user to
fragile                      at the destination by the picker, the   hypothetically walk to that shelf and ended with the user’s
                             visual output will be switched off,     confirmation that he picked a certain amount of items.
                             only vocal remains.                     Participants were asked to comment their hypothetical
The route is blocked by      The Map view marks the blocked          actions, e.g. by saying “I walk to the shelf 473 now” or “I
other pickers                path and suggests an alternative        pick 7 items from the shelf”. After ensuring that the
                             route.                                  participants understood the basic interaction flow of the
                                                                     interface, the study started by introducing the first
The picker is experienced    The Map view is omitted.
                                                                     alternative flow. All alternative flows (flows containing
The environment is noisy     The vocal input and output is           adaptation rules) were applied to the same scenario as
                             switched off, only visual output        practiced in the basic flow. Prior to playing through the
                             remains                                 alternative flows, participants were informed about the
The picking is not           An image of the item to be picked is    condition of the adaptation rule (e.g. “imagine you are now
performed due to some        shown, the vocal output is repeated.    in a noisy environment”), but not about the actual rule (i.e.
confusion or distraction                                             the action of the rule). All five rules were played through
                                                                     and the sequence of the adaptation rules was permutated to
Table 1. Variations of the context and its consequences for the
                    interaction modalities
                                                                     avoid order effects. After each rule, the 9-item
                                                                     questionnaire was filled out.
User Study 1                                                         Since most of the scales of the questionnaire were not
We have conducted a first user study in order to evaluate            normal-distributed, we applied non-parametric tests for the
the five adaptation rules from the end-users point-of view           data analysis. We calculated the Friedman test for every
(see [1]). The study aimed at evaluating the applicability           single questionnaire scale and the aggregated overall rating
and usefulness of the adaptation rules by assessing the              from all 9 scales (Bonferroni-corrected) to assess
quality of the adaptation rules as subjectively perceived by         differences between the five adaptation rules. In case of
the participants. The general concept “quality” was                  significance, we calculated a post-hoc Wilcoxon signed-
operationalized by several more specific constructs, e.g.
rank test for each pair of adaptation rule (Bonferroni-         coherent preference pattern. Traffic Jam and Pick Timeout
corrected as well).                                             are consistently and undoubtedly preferred by the users
                                                                (with very good overall ratings of 6.6 and 6.4 on a scale
The Friedman test revealed significant differences for the
                                                                from 0-7). Alongside the good rating of these two rules, the
aggregated overall rating over all 9 scales ( ²(4) = 18.74, p
                                                                standard deviation is very small, indicating a very high
= .001) and for 5 of the subscales: Appropriateness ( ²(4) =
                                                                agreement between the participants. However, the Fragile
19.26, p = .001), Performance (Z = -2.69, p=.007), Error-
                                                                Object rule, as the worst rated one, shows the highest
Prevention ( ²(4) = 22.73, p = .000), Intuitiveness ( ²(4) =
                                                                variance in the ratings between the subjects. This indicates
22.31, p = .000) and General Likeability ( ²(4) = 18.92, p =
                                                                that there is no strong agreement between the subjects, yet
.001). Only these significantly different scales are regarded
                                                                still most of the subjects gave comparably low ratings for
in detail here. Post-hoc tests revealed a significant
                                                                that rule. A possible explanation for this finding can be
difference in the rating between the rules Fragile Objects
                                                                drawn from the subject’s comments. While all subjects
and Traffic Jam (Z = -2.60, p = .009) and Experienced
                                                                gave a positive opinion about the idea to support the
Worker and Traffic Jam (Z = -2.70, p=.007). The
                                                                process of picking a fragile object, most of the subjects
significant differences in the subscale Appropriateness are
                                                                noted that the actual realisation of that rule was poor.
between the rules Fragile Objects and Traffic Jam (Z = -
                                                                Turning off the display was irritating and non-intuitive to
2.62, p = .009) and Fragile Objects and Pick Timeout (Z = -
                                                                the subjects. The abrupt darkness in the HMD was
2.69, p = .007). For the subscale Error prevention, the
                                                                perceived as a break-down of the system and therefore
significant differences can be found between the rules
                                                                caused confusion. Rather, subjects had wished to receive a
Fragile Object and Pick Timeout (Z = -2.71, p = .007),
                                                                short warning message before turning off the display.
Traffic Jam and Experienced Worker (Z = -2.81, p = .005)
and Pick Timeout and Experienced Worker (Z = -2.68, p =         We found similarities between those rules that were ranked
.007). Intuitiveness shows significantly different values for   well and those that were ranked poor. The group of poorly
the rules Fragile Objects and Traffic Jam (Z = -2.69, p =       ranked rules was omitting information like the visual output
.007). Finally, although the Friedman test revealed             and the Map view with regard to the Basic Interaction
significant differences between the rules for the scales:       Flow. The Fragile rule takes a prominent position as a very
general Likeability and Performance; direct pairwise            strong modality, the visual channel, is shut off. Those rules
comparison failed reaching significance due to Bonferroni       that were ranked well however delivered additional
correction.                                                     information like the blocked path or the image of the item.
                                                                This noticeable difference between the adaptation rules is
                                                                presumably the reason for the striking difference in the
                                                                preference ratings. Therefore, in the second study, we
                                                                investigated the role of adding vs. removing information in
                                                                the course of interface adaptation. The second study tested
                                                                the hypothesis that the poorly ranked adaptation rules will
                                                                be higher ranked when information is not only removed but
                                                                the removal of information is actually explained beforehand
                                                                by adding information.

                                                                User Study 2
                                                                The goal of the user study 2 was to evaluate whether the
                                                                comparably poor performance of the rules Fragile Object,
                                                                Experience User and Noisy Environment was improved by
                                                                adding information (i.e. also called user support in [2])
                                                                prior to showing the adaptation in UIs. User support means
                                                                the forgoing explanation of an occurring adaptation or hints
                                                                of an approaching adaptation. The design of the study is
                                                                same as in user study 1. However, in user study 1 we used a
                                                                paper based map to simulate the warehouse layout and in
                                                                user study 2 we simulated the warehouse environment on
      Fig. 5. Study 1: Overall rating and the subscales         the ground of a huge meeting room, having papers as
    Appropriateness, Error-Prevention and Intuitiveness         shelves and real items on the shelves representing the items
                                                                to be picked (see Fig. 6). Consequently, users were truly
The big picture of the results (see Fig. 5) shows a clear       able to move around and pick the items, which made the
trend: all quality aspects of the Fragile Object rule are       setting more realistic. The conditions for the adaptation
consistently rated the worst, and the Traffic Jam and Pick      rules were also implemented in a more realistic way, e.g. by
Timeout rule are consistently rated best. This pattern can be   putting obstacles in the way for the Traffic Jam rule or
observed for all quality scales, indicating a clear and
using real fragile objects (glasses) for the Fragile rule. An            The four other rules Experienced Worker, Traffic
alongside research question was therefore, if the more                   Jam, Pick Timeout and Noisy did not change in the
realistic setting affects the evaluation results. This means,            course of the second experiment
since 2 out of four rules (Traffic Jam and Pick Timeout)
were not changed, the more realistic setting of the second
study would not affect the reliability of evaluation if the
evaluation scores of these two rules did not change.
Participants again were company staff or students of the
local university (who did not participate in the first study).
A total of 10 participants took part in the study, 9 were
male and 1 was female. The average age of participants was
29 years (SD = 4.44).


                                                                     Fig. 7 Study 2: Overall rating and the subscales User
        Fig.6 Evaluation Environment of User Study 2                   Experience, Error-Prevention and Intuitiveness
Since most of the scales of the questionnaire were not           In order to test these observations for significance, we
normal-distributed, we applied non-parametric tests for the      conducted a Kruskal-Wallis-Test comparing the results of
data analysis. We calculated the Friedman test for every         the first and the second study. The test reveals that the
single questionnaire scale and the aggregated overall rating     Overall Rating of the Fragile rule increased significantly
from all nine scales (Bonferroni-corrected) to assess            (H(1) = 12.17, p=.000), which can be attributed to the
differences between the five adaptation rules. In case of        scales Appropriatedness (H(1) = 9.44, p = .002),
significance, we calculated a post-hoc Wilcoxon signed-          Performance (H(1) = 11.14, p = .001), Error Prevention
rank test for each pair of adaptation rule (Bonferroni-          (H(1) = 11.44, p = .001), User Experience (XX(1) = 7.15, p
corrected as well).                                              = .008), Intuitiveness (XX(1) = 12.75, p = .000) and general
                                                                 Likeability (XX(1) = 8.07, p = .005). Thus, for the Fraglie
The Friedman test revealed significant differences for the
                                                                 rule, all scales except Continuity and Comprehensibility
aggregated overall rating over all 9 scales ( ²(4) = 17.99, p    increased significantly. All other comparisons were not
= .001) and for three of the subscales: Error-Prevention( ²      significant. Thus, all other rules were not rated better or
(4) = 17.76, p = .001), Intuitiveness ( ² (4)= -17.19, p=.002)   worse (for no scale) compared to study 1.
and User Experience ( ² (4) = 15.96, p = .003). The scales
with significant differences between the rules are displayed     DISCUSSION
in Fig. 7. Although the Friedman test revealed significant       User study 2 addressed the research question: does the
differences between the rules for all these scales; pairwise     addition of information prior to the removal of information
comparison failed reaching significance due to Bonferroni        in the course of an adaptation of the interface improve the
correction. Taking a look at the graphs, there are three main    perceived quality of the adaptation rule? The results of the
interesting observations:                                        study partly support this hypothesis. While the Fragile rule
         The Fragile rule improved significantly compared        was improved significantly in almost all the scales, the
                                                                 Experience and Noisy rules did not improve.
         to the first study
         The Experiences Worker rule performs                    The improvement of the Fragile rule can most probably be
                                                                 attributed to what Paymans et al. [2] call user support.
         consistently worse than the other rules (although
                                                                 According to the authors, users experience difficulty in
         pairwise comparison did not reach significance)         building adequate models of adaptive systems, therefore
                                                                 user support is expected to help users understand and learn
                                                                 the adaptive rules. For the Fragile rule, the performance
improved significantly with the help of user support. Before        fidelity of the ratings when evaluating adaptive features of
shutting down the display of HMD, the users have been               an interface. Our studies suggest that the rating of adaptive
notified by a short alert video to be cautious for picking          rules has no direct and obvious relation to the fidelity of the
fragile objects, so the rational of the rule can be more easily     evaluation enviroment.
understood (prevent the user from visual distraction).
                                                                    CONCLUSIONS
However, for the Experienced Worker and Noisy rule, the             In the process of AUI development, adaptive rules must be
ratings are not improved by adding explanatory information          carefully designed and evaluated to avoid usability and user
as user support. We can think of two possible reasons for           experience pitfalls. Applying UCE in different phases of the
this finding. First, autonomous interface adaptations can
                                                                    development is helpful to detect the flaws of adaptive
easily reduce the usability of a system. Loss of control            features in time. On the basis of the results of two user
might be an issue in both rules. For example in the Noisy           studies, some common drawbacks of adaptive systems are
rule, users cannot confirm their location or the amount             detected and eliminated in our application system. The
number by voice; instead the system will set a timeout for          remedies or potential improvements of some of these
automatic confirmation. Setting the timeout either too long         drawbacks are proposed in our paper. As a main result, we
or too short will consequently put the user in an                   came to know that adding user support information can help
uncomfortable situation (i.e. waiting for or missing the            users to comprehend and accept adaptation rules.
following system information). In the Experienced Worker            Furthermore, we argue that enriching the context and users’
rule, the user might want to decide himself if he gets to see       profile can increase the precision of adaptation. Also,
the map or not; although he might not really need it. In both       enabling the user to intervene into the adaptation at any
cases, the loss of control over the system might be a
                                                                    time will improve user experience by improving the
problem. To overcome the problem of controllability, we             controllability. We are convinced that the iterative
can enrich the user profile and context information to              evaluation of adaptive systems is crucial to the successful
provide even more precise and personalized adaptations.             development of AUIs. Regarding the iterative testing of
Furthermore, we can also consider increasing the flexibility        such systems, we are happy to report that the fidelity of the
of operation, so that users have more rights to intervene the       testing environment obviously plays no role with respect to
adaptation. Second, even in user study 2, the setting of the        the users’ rating of the adaptation rules. Thus, rapid
user study is still simulated. A real testing environment with      iterative testing of adaptation rules does not need to be an
real users (i.e. real pickers) might result in different ratings.   expensive enterprise and is therefore highly recommended.
Although the change in the fidelity between the two studies
presented here did not affect the ratings (see below); a real       REFERENCES
environment with real users might yield to more valid               1.Bongartz, S., Jin, Y., Paterno, F., Rett, J., Santoro, C.,
results (e.g. to imagine being an experienced user might not          Spano, L.D. Adaptive User Interfaces for Smart
result in the same rating as actually being an experienced            Environments with the Support of Model-based
user).                                                                Languages. To be published in Proc. AmI 2012.
 Furthermore, the change in the evaluative setting did not          2. Paymans, T.E., Lindenberg, J., Neerincx, M. Usability
affect the rating of the rules. This is an interesting finding        trade-offs for adaptive user interfaces: ease of use and
with regard to evaluation methodologies. Although the                 learnability. In Proc. IUI '04. ACM Press (2004), 301-
study design was much more realistic in the second study,             303.
the ratings of the unchanged rules Traffic Jam and Pick
                                                                    3.van Velsen, L., van der Geest, T., Klaassen, R.,
Timeout were exactly the same for both studies. Thus we
                                                                      Steehouder, M. User-centered evaluation of adaptive and
can conclude that a low-fidelity evaluation setting (e.g.
                                                                      adaptable systems: A literature review. Knowl. Eng. Rev.
imagining the movement through a warehosue vs. actually
                                                                      23, 3 (2008), 261-281.
moving through a simulated warehouse) does not affect the