User-Centered Evaluation of an Adaptive User Interface in the Context of Warehouse Picking Jörg Rett Yucheng Jin Sara Bongartz SAP AG SAP AG SAP AG Darmstadt, Germany Darmstadt, Germany Darmstadt, Germany joerg.rett@sap.com yucheng.jin@sap.com sara.bongartz@sap.com ABSTRACT of users, but also test and evaluate the prototypes with users Although nowadays adaptive user interfaces (AUIs) can be in different design phases. Such iterative evaluations can be found in many applications, many downsides and named user-centred evaluations (UCE) and are necessary jeopardies of AUIs are not yet sufficiently researched. We for the successes of adaptive systems, by making the take a user-centered design in the development of an designers understand the users’ experience and learning adaptive application and demonstrate that the user- process of adaptation rules. UCE aims to verify the quality friendliness of an adaptive application benefits from an of a product, detect problems and support decisions [3] and early and iterative evaluation of the adaptation rules. find and solve problems in time. As a result, the system can Drawbacks of adaptive interfaces are discovered and solved be more easily adopted by users; with a greater ease of use in our evaluation- and design-process and recommendations and more pleasant user experience. for the development of adaptive systems are given. The goals of different phases in the iterative design process Author Keywords according to [3] are shown in Fig. 1. In the paper at hand, Adaptive service front-ends, Context-aware user interfaces, the authors focus on the phases associated with “detecting Warehouse picking system, User-centered evaluation problems”. These phases involve low-fidelity and high- fidelity prototypes. According to the concept of UCE, ACM Classification Keywords application prototypes should be evaluated at each level to H.5.2 Evaluation/methodology; Prototyping; User-centered assure a successful design process. design; Artificial, augmented, and virtual realities; Audio input/output INTRODUCTION Nowadays, intelligent systems and ubiquitous computing technologies make people interact with computers in a personalized and smart way. Along with these trends, adaptive user interfaces (AUIs) intend to provide an effective way of interaction between humans and computers, e.g. by adapting to users’ profiles and the context of use. AUIs have been applied in many areas like medical treatment, education, transport etc. However, in Fig.1 Phases of the iterative design process (according to [3]) practice, there are still many shortcomings and open questions of AUIs. Careful development and evaluation of In this paper, we present a prototype of an adaptive adaptive features is crucial for successful AUIs. warehouse order picking system consisting of an adaptive, context-sensitive UI which is based on an architecture for By applying a user-centred design (UCD) methodology, the context-sensitive service front-ends (for details on the needs, desires, and limitations of end users of a product are architecture see [1]) which we evaluated in different phases given extensive attention at each stage of the design according to the principles of UCE. Based on a first user process. As a multi-stage problem solving process, not only study result of a low-fidelity prototype, we extracted UCD requires designers to analyse and design in the view usability problems specific to the adaptive features of the application and conducted a second user study with an improved high-fidelity prototype. Finally, we draw some Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are conclusions regarding the design of AUIs and provide not made or distributed for profit or commercial advantage and that copies indications for future work. bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. ADAPTIVE PROTOTYPE with arrow and red start point). Users can switch between Warehouse picking is a part of a logistics process often the four views by speaking the name of the respective tabs. found in retail and manufacturing industries. The adaptive application presented here is enhanced with context aware A Head-Mounted Display (HMD) and a wearable computer features which consider user-related aspects (tasks to are used to access the application. The UIs are implemented accomplish, personal preferences and knowledge, etc.), in HTML5, JavaScript and AJAX. The navigation route in technical aspects (available interaction resources, the Map view is drawn using the canvas label of HTML5. connectivity support, etc.) and environmental aspects (level Speech recognition is realized using the Google speech of noise, light, etc.). recognition engine. The architecture of the application implementation is shown in Fig.3. The display is used for The graphical user interface (GUI) consists of four views the visual output, the earphone for the vocal output and the (Order, Map, Task and Report), for the sake of brevity only microphone for the vocal input of the user. the Order view and the Map view are discussed. The Order view (shown in Fig.2) mainly contains information on the previous (i.e. shelf 451), the current (i.e. shelf 436) and the next (i.e. shelf 448) items to be picked. This sequence of picks is represented in three rows starting with the previous pick and having the current pick highlighted (i.e. inverted) and magnified. a) a) b) b) Fig. 3. a) Architecture of the prototype. b) Picking from a shelf using a Head-Mounted Display The basic interaction sequence (i.e. the basic interaction flow) with an example for an adaption is shown in Fig. 4: c) d) the picker is presented with three screens and two vocal outputs (upper balloons) and needs to perform two vocal Fig. 2. Design of the graphical user interface (GUI). a) Order inputs (lower balloons). Assuming that a picker who is view b) Map view c) User support in fragile mode d) User experienced, i.e. has been working for a long time in the support in noisy mode warehouse environment and thus should know by heart the location of the shelves, the Map view can be omitted. We The columns reflect the types of information available for assume that an indicator of the experience level is stored the pick (status, shelf, compartment, amount and container) within the profile of the picker and is added as context while only the status of the pick (e.g. open), the shelf information at run-time during the log-in procedure. identifier (e.g. 473) and the amount of items to be picked (e.g. 7) are relevant here. The active view is reflected as a Table 1 lists the five variations of the context and its highlighted tab in the bottom area. The main information in consequences for the interaction modalities with respect to the Map view is a simplified representation of the location the basic interaction flow. The adaptation server sends the of the shelves (in Bird eyes view) showing the current updated data to the wearable computer after a change in the location of the picker (i.e. the previous shelf), the context has triggered the execution of an adaptation rule. destination shelf (i.e. 473) and a suggested route (green line Some changes might be triggered by the smart environment (e.g. tracking of the picker’s position or the item’s usefulness, comprehensibility or simplicity, which were location). assessed by a questionnaire. To address such issues, the five adaptation rules were the independent variables. We had a within-subject design, meaning that every participant was confronted with every adaptation rule. The dependent variables were the subjectively perceived quality of the adaptation rule as assessed in a 9-item questionnaire. The questions originated from a list of non-functional requirements for the prototype identified in user studies in the beginning of the project and aimed at assessing the following aspects: the user’s awareness for the adaptation rule, its appropriateness and Fig. 4. Basic interaction flow with adaptation: the execution of comprehensibility, its effectiveness with respect to the rule for an experienced picker omits the appearance of the performance and usability, its error-prevention, continuity, Map view (dotted line) intuitiveness, and general likeability. USER CENTRED EVALUATION Participants were company staff or students of the local Following the principles of UCE in the design process of university. A total of 10 participants took part in the study, our AUI, we conducted two evaluations, one with a low- 9 were male and 1 was female. The average age of fidelity prototype and another one with a high-fidelity one. participants was 24 years (SD = 1.82). The technical set-up Addressing usability problems found in the first study, the consisted of an HMD with earphone worn by the second study was aimed at evaluating the effect of participants. The device presented the GUI and the vocal subsequent improvements on the prototype. output as shown in section 7. The sequence of the interaction was controlled by the moderator simulating the In order make both studies statistically and conceptually change of context and the execution of the adaptation rule. comparable, we use the same questionnaires and study design in both studies. We present and compare the results Participants were first introduced into the scenario and the of the two user studies and draw conclusions regarding the interface, i.e. getting familiar with the hypothetical situation design of AUIs. in the warehouse and learning how to interact with the interface. Participants were asked to play through a “basic Context variation Interaction consequence interaction flow” which started with the systems request to The items to be picked are After vocally confirming the arrival pick items from a certain shelf, required the user to fragile at the destination by the picker, the hypothetically walk to that shelf and ended with the user’s visual output will be switched off, confirmation that he picked a certain amount of items. only vocal remains. Participants were asked to comment their hypothetical The route is blocked by The Map view marks the blocked actions, e.g. by saying “I walk to the shelf 473 now” or “I other pickers path and suggests an alternative pick 7 items from the shelf”. After ensuring that the route. participants understood the basic interaction flow of the interface, the study started by introducing the first The picker is experienced The Map view is omitted. alternative flow. All alternative flows (flows containing The environment is noisy The vocal input and output is adaptation rules) were applied to the same scenario as switched off, only visual output practiced in the basic flow. Prior to playing through the remains alternative flows, participants were informed about the The picking is not An image of the item to be picked is condition of the adaptation rule (e.g. “imagine you are now performed due to some shown, the vocal output is repeated. in a noisy environment”), but not about the actual rule (i.e. confusion or distraction the action of the rule). All five rules were played through and the sequence of the adaptation rules was permutated to Table 1. Variations of the context and its consequences for the interaction modalities avoid order effects. After each rule, the 9-item questionnaire was filled out. User Study 1 Since most of the scales of the questionnaire were not We have conducted a first user study in order to evaluate normal-distributed, we applied non-parametric tests for the the five adaptation rules from the end-users point-of view data analysis. We calculated the Friedman test for every (see [1]). The study aimed at evaluating the applicability single questionnaire scale and the aggregated overall rating and usefulness of the adaptation rules by assessing the from all 9 scales (Bonferroni-corrected) to assess quality of the adaptation rules as subjectively perceived by differences between the five adaptation rules. In case of the participants. The general concept “quality” was significance, we calculated a post-hoc Wilcoxon signed- operationalized by several more specific constructs, e.g. rank test for each pair of adaptation rule (Bonferroni- coherent preference pattern. Traffic Jam and Pick Timeout corrected as well). are consistently and undoubtedly preferred by the users (with very good overall ratings of 6.6 and 6.4 on a scale The Friedman test revealed significant differences for the from 0-7). Alongside the good rating of these two rules, the aggregated overall rating over all 9 scales ( ²(4) = 18.74, p standard deviation is very small, indicating a very high = .001) and for 5 of the subscales: Appropriateness ( ²(4) = agreement between the participants. However, the Fragile 19.26, p = .001), Performance (Z = -2.69, p=.007), Error- Object rule, as the worst rated one, shows the highest Prevention ( ²(4) = 22.73, p = .000), Intuitiveness ( ²(4) = variance in the ratings between the subjects. This indicates 22.31, p = .000) and General Likeability ( ²(4) = 18.92, p = that there is no strong agreement between the subjects, yet .001). Only these significantly different scales are regarded still most of the subjects gave comparably low ratings for in detail here. Post-hoc tests revealed a significant that rule. A possible explanation for this finding can be difference in the rating between the rules Fragile Objects drawn from the subject’s comments. While all subjects and Traffic Jam (Z = -2.60, p = .009) and Experienced gave a positive opinion about the idea to support the Worker and Traffic Jam (Z = -2.70, p=.007). The process of picking a fragile object, most of the subjects significant differences in the subscale Appropriateness are noted that the actual realisation of that rule was poor. between the rules Fragile Objects and Traffic Jam (Z = - Turning off the display was irritating and non-intuitive to 2.62, p = .009) and Fragile Objects and Pick Timeout (Z = - the subjects. The abrupt darkness in the HMD was 2.69, p = .007). For the subscale Error prevention, the perceived as a break-down of the system and therefore significant differences can be found between the rules caused confusion. Rather, subjects had wished to receive a Fragile Object and Pick Timeout (Z = -2.71, p = .007), short warning message before turning off the display. Traffic Jam and Experienced Worker (Z = -2.81, p = .005) and Pick Timeout and Experienced Worker (Z = -2.68, p = We found similarities between those rules that were ranked .007). Intuitiveness shows significantly different values for well and those that were ranked poor. The group of poorly the rules Fragile Objects and Traffic Jam (Z = -2.69, p = ranked rules was omitting information like the visual output .007). Finally, although the Friedman test revealed and the Map view with regard to the Basic Interaction significant differences between the rules for the scales: Flow. The Fragile rule takes a prominent position as a very general Likeability and Performance; direct pairwise strong modality, the visual channel, is shut off. Those rules comparison failed reaching significance due to Bonferroni that were ranked well however delivered additional correction. information like the blocked path or the image of the item. This noticeable difference between the adaptation rules is presumably the reason for the striking difference in the preference ratings. Therefore, in the second study, we investigated the role of adding vs. removing information in the course of interface adaptation. The second study tested the hypothesis that the poorly ranked adaptation rules will be higher ranked when information is not only removed but the removal of information is actually explained beforehand by adding information. User Study 2 The goal of the user study 2 was to evaluate whether the comparably poor performance of the rules Fragile Object, Experience User and Noisy Environment was improved by adding information (i.e. also called user support in [2]) prior to showing the adaptation in UIs. User support means the forgoing explanation of an occurring adaptation or hints of an approaching adaptation. The design of the study is same as in user study 1. However, in user study 1 we used a paper based map to simulate the warehouse layout and in user study 2 we simulated the warehouse environment on Fig. 5. Study 1: Overall rating and the subscales the ground of a huge meeting room, having papers as Appropriateness, Error-Prevention and Intuitiveness shelves and real items on the shelves representing the items to be picked (see Fig. 6). Consequently, users were truly The big picture of the results (see Fig. 5) shows a clear able to move around and pick the items, which made the trend: all quality aspects of the Fragile Object rule are setting more realistic. The conditions for the adaptation consistently rated the worst, and the Traffic Jam and Pick rules were also implemented in a more realistic way, e.g. by Timeout rule are consistently rated best. This pattern can be putting obstacles in the way for the Traffic Jam rule or observed for all quality scales, indicating a clear and using real fragile objects (glasses) for the Fragile rule. An The four other rules Experienced Worker, Traffic alongside research question was therefore, if the more Jam, Pick Timeout and Noisy did not change in the realistic setting affects the evaluation results. This means, course of the second experiment since 2 out of four rules (Traffic Jam and Pick Timeout) were not changed, the more realistic setting of the second study would not affect the reliability of evaluation if the evaluation scores of these two rules did not change. Participants again were company staff or students of the local university (who did not participate in the first study). A total of 10 participants took part in the study, 9 were male and 1 was female. The average age of participants was 29 years (SD = 4.44). Fig. 7 Study 2: Overall rating and the subscales User Fig.6 Evaluation Environment of User Study 2 Experience, Error-Prevention and Intuitiveness Since most of the scales of the questionnaire were not In order to test these observations for significance, we normal-distributed, we applied non-parametric tests for the conducted a Kruskal-Wallis-Test comparing the results of data analysis. We calculated the Friedman test for every the first and the second study. The test reveals that the single questionnaire scale and the aggregated overall rating Overall Rating of the Fragile rule increased significantly from all nine scales (Bonferroni-corrected) to assess (H(1) = 12.17, p=.000), which can be attributed to the differences between the five adaptation rules. In case of scales Appropriatedness (H(1) = 9.44, p = .002), significance, we calculated a post-hoc Wilcoxon signed- Performance (H(1) = 11.14, p = .001), Error Prevention rank test for each pair of adaptation rule (Bonferroni- (H(1) = 11.44, p = .001), User Experience (XX(1) = 7.15, p corrected as well). = .008), Intuitiveness (XX(1) = 12.75, p = .000) and general Likeability (XX(1) = 8.07, p = .005). Thus, for the Fraglie The Friedman test revealed significant differences for the rule, all scales except Continuity and Comprehensibility aggregated overall rating over all 9 scales ( ²(4) = 17.99, p increased significantly. All other comparisons were not = .001) and for three of the subscales: Error-Prevention( ² significant. Thus, all other rules were not rated better or (4) = 17.76, p = .001), Intuitiveness ( ² (4)= -17.19, p=.002) worse (for no scale) compared to study 1. and User Experience ( ² (4) = 15.96, p = .003). The scales with significant differences between the rules are displayed DISCUSSION in Fig. 7. Although the Friedman test revealed significant User study 2 addressed the research question: does the differences between the rules for all these scales; pairwise addition of information prior to the removal of information comparison failed reaching significance due to Bonferroni in the course of an adaptation of the interface improve the correction. Taking a look at the graphs, there are three main perceived quality of the adaptation rule? The results of the interesting observations: study partly support this hypothesis. While the Fragile rule The Fragile rule improved significantly compared was improved significantly in almost all the scales, the Experience and Noisy rules did not improve. to the first study The Experiences Worker rule performs The improvement of the Fragile rule can most probably be attributed to what Paymans et al. [2] call user support. consistently worse than the other rules (although According to the authors, users experience difficulty in pairwise comparison did not reach significance) building adequate models of adaptive systems, therefore user support is expected to help users understand and learn the adaptive rules. For the Fragile rule, the performance improved significantly with the help of user support. Before fidelity of the ratings when evaluating adaptive features of shutting down the display of HMD, the users have been an interface. Our studies suggest that the rating of adaptive notified by a short alert video to be cautious for picking rules has no direct and obvious relation to the fidelity of the fragile objects, so the rational of the rule can be more easily evaluation enviroment. understood (prevent the user from visual distraction). CONCLUSIONS However, for the Experienced Worker and Noisy rule, the In the process of AUI development, adaptive rules must be ratings are not improved by adding explanatory information carefully designed and evaluated to avoid usability and user as user support. We can think of two possible reasons for experience pitfalls. Applying UCE in different phases of the this finding. First, autonomous interface adaptations can development is helpful to detect the flaws of adaptive easily reduce the usability of a system. Loss of control features in time. On the basis of the results of two user might be an issue in both rules. For example in the Noisy studies, some common drawbacks of adaptive systems are rule, users cannot confirm their location or the amount detected and eliminated in our application system. The number by voice; instead the system will set a timeout for remedies or potential improvements of some of these automatic confirmation. Setting the timeout either too long drawbacks are proposed in our paper. As a main result, we or too short will consequently put the user in an came to know that adding user support information can help uncomfortable situation (i.e. waiting for or missing the users to comprehend and accept adaptation rules. following system information). In the Experienced Worker Furthermore, we argue that enriching the context and users’ rule, the user might want to decide himself if he gets to see profile can increase the precision of adaptation. Also, the map or not; although he might not really need it. In both enabling the user to intervene into the adaptation at any cases, the loss of control over the system might be a time will improve user experience by improving the problem. To overcome the problem of controllability, we controllability. We are convinced that the iterative can enrich the user profile and context information to evaluation of adaptive systems is crucial to the successful provide even more precise and personalized adaptations. development of AUIs. Regarding the iterative testing of Furthermore, we can also consider increasing the flexibility such systems, we are happy to report that the fidelity of the of operation, so that users have more rights to intervene the testing environment obviously plays no role with respect to adaptation. Second, even in user study 2, the setting of the the users’ rating of the adaptation rules. Thus, rapid user study is still simulated. A real testing environment with iterative testing of adaptation rules does not need to be an real users (i.e. real pickers) might result in different ratings. expensive enterprise and is therefore highly recommended. Although the change in the fidelity between the two studies presented here did not affect the ratings (see below); a real REFERENCES environment with real users might yield to more valid 1.Bongartz, S., Jin, Y., Paterno, F., Rett, J., Santoro, C., results (e.g. to imagine being an experienced user might not Spano, L.D. Adaptive User Interfaces for Smart result in the same rating as actually being an experienced Environments with the Support of Model-based user). Languages. To be published in Proc. AmI 2012. Furthermore, the change in the evaluative setting did not 2. Paymans, T.E., Lindenberg, J., Neerincx, M. Usability affect the rating of the rules. This is an interesting finding trade-offs for adaptive user interfaces: ease of use and with regard to evaluation methodologies. Although the learnability. In Proc. IUI '04. ACM Press (2004), 301- study design was much more realistic in the second study, 303. the ratings of the unchanged rules Traffic Jam and Pick 3.van Velsen, L., van der Geest, T., Klaassen, R., Timeout were exactly the same for both studies. Thus we Steehouder, M. User-centered evaluation of adaptive and can conclude that a low-fidelity evaluation setting (e.g. adaptable systems: A literature review. Knowl. Eng. Rev. imagining the movement through a warehosue vs. actually 23, 3 (2008), 261-281. moving through a simulated warehouse) does not affect the