=Paper=
{{Paper
|id=Vol-2245/mdetools_paper_4
|storemode=property
|title=A Study Design Template for Identifying Usability Issues in Graphical Modeling Tools
|pdfUrl=https://ceur-ws.org/Vol-2245/mdetools_paper_4.pdf
|volume=Vol-2245
|authors=Jakob Pietron,Alexander Raschke,Michael Stegmaier,Matthias Tichy,Enrico Rukzio
|dblpUrl=https://dblp.org/rec/conf/models/PietronRSTR18
}}
==A Study Design Template for Identifying Usability Issues in Graphical Modeling Tools==
A study design template for identifying usability issues in graphical modeling tools Jakob Pietron1 , Alexander Raschke1 , Michael Stegmaier1 , Matthias Tichy1 , and Enrico Rukzio2 1 Institute of Software Engineering and Programming Languages 2 Institute of Media Informatics, Ulm University, 89081 Ulm, Germany {jakob.pietron, alexander.raschke, michael-1.stegmaier, matthias.tichy, enrico.rukzio}@uni-ulm.de Abstract. Model-driven engineering aims at increasing the productiv- ity of software engineering and the quality of the software. These positive results have been confirmed in several empirical studies. However, those studies also report that usability of model-driven engineering tools is generally considered to be poor. This is also the prevalent opinion on us- ability expressed in discussions in academica as well as in our collabora- tions with practitioners. Unfortunately, there are scarcely any empirical studies on identifying usability issues nor papers reporting on systemat- ically evaluated usability improvements in model-driven engineering. In this paper, we present a study design template for identifying usability issues specifically in graphical editors. This template is grounded in us- ability research as well as empirical research methods. We illustrate the proposed study design on the example of identifying usability issues in a graphical editor for developing state machines. Keywords: Usability · Graphical Modeling Tools · Model-driven Engi- neering · Study Design. 1 Motivation & Problem Statement The abstraction introduced by model-driven engineering (MDE) aims for build- ing complex systems with high quality in a shorter time. Unfortunately, MDE is still not widely used in industry due to several drawbacks [8]. Condori-Fernández et al. [10] even conclude that most of these drawbacks share the same problem: a lack of usability in the MDE tools. It seems that MDE tool developers are focused on providing frameworks that make it easy to develop domain-specific languages while end-user usability seems to be neglected. A common argument for less focus on usability is that these tools are only used by experts. This aspect should not count because better usability can also improve the efficiency of experts. Another problem is that common usability guidelines are not easy to apply, since MDE tools are more than just a means of drawing models. MDE tools should support an analyst in developing a system [21]. Therefore, our research focus is on analyzing and improving the usability of MDE tools. Before the usability of MDE tools can be improved, the actual usability issues must be identified. In this paper, we propose a study design template to systematically evaluate the usability of graphical modeling tools. The aim of this study design template is to identify deficits in usability by generating qualitative data instead of quantitative data. While quantitative data, as collected by e. g. Condori-Fernández et al. [10], can indicate problems, make them measurable and enable the comparison of solutions, the actual problems are not identified. Since we are interested in the real causes of poor usability, we propose a qualitative study design. In our study design template we focus on the creation and modification of graphical representations of models. All graphical model- ing tools have this functionality in common whereas other functionality such as debugging or model checking is editor and language specific. We suggest per- forming the evaluations in terms of efficiency, effectiveness and satisfaction, the main aspects of usability as proposed by ISO 9241-11 [13]. Our study design template is meant to form the basis for a family of experiments as advocated by Basili [6] and as such should enable a simple replication of collected data. We suggest a think-aloud observation after which participants are interviewed and asked to fill in questionnaires. The questionnaires serve to collect experience level, demographic data, and contains questions for the calculation of the System Usability Score (SUS) [9] and the Technology Affinity (TA-EG) [15]. To enable a qualitative analysis of the collected data, coding techniques are used to annotate captured screen recordings. After discussing the related work in Section 2, we describe our study design template in Section 3 and finally conclude this paper in Section 4. 2 Related Work There is a lot of literature discussing good practices in conducting usability stud- ies in general (see, e. g., [17] or the literature survey in [12]). In all this scientific work different methods for measuring usability or identifying usability problems are introduced and discussed. However, the composition of these methods to a result-oriented study design is often left to the reader. As proposed by Basili [6], especially for the repeatability of experiments guidelines or frameworks are indispensable. In [1], for example, such a framework for measuring the usability of websites is proposed. Similarly, [2] introduces a more general framework, but based on very low-level interactions and therefore difficult to apply in a given context. Moreover, both works aim at usability measurement, not at identifying concrete usability problems. In the field of modeling tools, which in this context also includes UML tools or domain-specific languages (DSL), the following three different areas can be distinguished: The first one includes work that focuses on the usability of modeling lan- guages themselves (for an overview, see [20]), but not the usability of the used tools. In the study conducted by Cuenca et al. [11] similar methods are used as we propose (SUS, observations, and questionnaires, see below), but only the usability of (textual) DSLs is measured and not the usability of the used tools. This also happens in [5] where the authors present ”a way to systematize the evaluation process” of DSLs. They introduce a generic process for evaluating DSLs as a user interface of a developer using a DSL. Poltronieri et al. go one step further and define a more precise framework Usa-DSL [19] for this activity in order to carry out replicated usability studies of DSLs. Both papers try to evaluate the DSL itself, but not the (graphical) tools for working with (graphical) DSLs as we do. In the second area, the usability of tools is considered, but with the focus on tool selection. For example, the work of Safdar et al. [24] is about comparing the usability of different tools in several diagram types. Rouly et al. [23] describe a method for understanding usability of Visual Integrated Development Environ- ments (IDE) used by non-programmers. The considered tools are mostly used for interactive editing of graphical models, yet not necessarily models in the sense of MDE. The usability is measured by analyzing their interface characteristics according to a proposed model. The main goal of this work is the comparison of the tools’ usability in an abstract way. The third area, which is also addressed by us, considers the usability of mod- eling tools per se. In [7] the authors conduct an experiment to measure the usability of six UML modeling tools. They captured the time each of the 58 participants needed to fulfill several tasks under the assumption, that ”the time of performing tasks is one of the usability indicators”. The results were con- firmed by an analysis with GOMS [14]. GOMS is a formal usability method that tries to measure the usability of an interface by decomposing typical user tasks into simple basic actions. For each task, the needed time based on estimations per basic user action is calculated and valuated. Besides this simple time com- parison, the authors collected participants’ comments and suggestions, but not systematically. Interestingly, most of the reported problems obviously have not been improved in the last 13 years. In [10] Condori-Fernández et al. introduce an evaluation model for identify- ing the usability of MDE tools by measuring the completeness of user performed tasks in relation to the number of steps needed to finish them. Their framework is defined in a very abstract way by an evaluation model and a generic process. When applying the framework to a tool, the authors propose exemplary methods to be used. Concrete usability problems are discovered by classification of obser- vations against ergonomic criteria. In contrast, our proposed qualitative study design template, which is explained in detail in the following sections, allows usability problems to be discovered on a fine-grained level. 3 Study Design Template Our study design template can be seen as a generic guideline for how to con- duct a case study to identify fundamental usability-related problems of graphical modeling tools. The study design template focuses on usability problems that occur while users are creating and editing a graphical representation of a model with the functionality provided by a specific modeling tool and not usability problems that are related to a particular DSL. Yin defines a case study as “an empirical inquiry that investigates a contem- porary phenomenon within its real-life context” [26]. Furthermore, we can define the type of case study more precisely: it should be an exploratory and explana- tory single-case case study. Exploratory to identify the actual usability problems in graphical modeling tools in general and the parts, features, and components of an editor that are responsible for the usability problems in particular. By investigating the reasons why these problems occur the case study additionally becomes explanatory. As required by Yin [26], in this section, we describe the requirements to the objects of study (editor and graphical modeling language), how tasks should be designed, how to find the right participants, and how to analyze the col- lected data. For each aspect we discuss the requirements, introduce the solution suggested by our study design template and illustrate it by an example. The example given is based on a study we conducted to evaluate the usability of the graphical modeling tool Yakindu Statechart Tools. 3.1 Objects of Study The choice of the right editor for a problem discovery study depends on the used graphical language. The chosen graphical language should offer the possibility to define real-world tasks with different levels of complexity. The chosen language should be well known by participants to have no negative effect on the internal validity. If there is no specific domain specific language (DSL) predefined, we suggest state machines as the graphical language for our study design template; state ma- chines are well known to students as well as developers. Therefore, participants can be recruited from a university as well as industrial context. Participants do not have to learn a new graphical language. State machines support a wide range of complexity. It is possible to create very simple state machines but also very complex ones. This makes it easy to create different kinds of tasks for the study. Furthermore, you can choose from a wide range of different graphical modeling editors that support state machines. On the other hand, if the editor to be evaluated supports exactly one graph- ical modeling language and no real users are available for participating in the study, the participants, e.g., students, require a training for that DSL. Regard- less of whether real users or just test users participate, for later analysis, it is important to distinguish between problems related to the tool and problems related to the used language. Example The Eclipse community provides with Eclipse Modeling Frame- work (EMF), Graphical Editing Framework (GEF), and Graphical Modeling Framework (GMF) a rich toolbox to create graphical model-driven DSLs and corresponding graphical tools. Many graphical tools such as Papyrus3 , Graphiti4 , and Yakindu Statechart Tools5 are based on these frameworks. We choose Yakindu as an editor based on GEF to be evaluated. We prefer Yakindu for the following two reasons: first, it supports state machines (see above). Second, it is the tool which focuses the most on graphical editing. It hides a lot of complexity by al- lowing the user to manipulate and create elements directly in the graphical view. In contrast to, e.g., Papyrus, no dialog windows must be filled in. Most inter- actions take place inside the graphical view. This in turn supports the internal validity of a study. 3.2 Tasks As mentioned before, our study design template focuses on the interaction with the graphical representation of a model which is a diagram in most cases. We suggest to define tasks that let participants recreate a given graphically modeled system, e.g., handed out on the printed task description, or edit and refactor a prepared diagram by using the tool to be evaluated. A participant’s result does not have to look exactly the same as in the task description, but it must be functionally equivalent. Setting the task to just create a diagram similar to the one shown on a screenshot ensures that all observed problems are related to the editor and not influenced by a problem in understanding the task or textual system description. Overall, this supports the internal validity. The tasks to develop should cover a wide range of scenarios and function- ality of the chosen graphical DSL and modeling tool. Furthermore, the tasks should become more and more complex in order to support the users’ learning curve [17]. On the one hand, the complexity of a task depends on the number of objects, connections and language concepts used; on the other hand, the more objects that need to be managed, the more it becomes difficult to keep the mod- eled diagram readable and understandable. Therefore, the layout of the diagram created by a participant should be readable and comprehensible. We provide a set of rules with the intention to make the participants change a default layouted diagram to improve its readability and comprehensibility. Some of our rules are adopted from the work of Purchase [22]: – Objects do not overlap – Edges do not overlap and have a distinguishable margin in between [22] – Prevent crossing edges [22] – A label has a clear relation to its edge – Prevent bends in edge’s path [22] – All labels, names, titles, and descriptions are displayed completely without abridging points 3 https://www.eclipse.org/papyrus/ 4 https://www.eclipse.org/graphiti/ 5 https://www.itemis.com/en/yakindu/state-machine/ The defined rules are a minimum set of rules that should be fulfilled by a par- ticipant’s diagram to complete a task. This enforces users to adapt the layout of their diagram by using the tools provided by the evaluated editor. These rules can be extended depending on the chosen graphical modeling language and eval- uated editor. Example A study can consist of different state machine tasks: state machines should be created from scratch, but also existing ones should be edited, and refactored. The tasks should cover a wide range of state machine functionality such as hierarchy, nested states, parallel states, transition with guards in different levels of complexity. Simple state machines consist of only a few connected states without any hierarchy or parallelism. Complex state machines can consist of more than 20 states with up to 50 transitions, hierarchy and parallelism. All participants receive a prepared Yakindu workspace (following the example from the previous subsection), and for each task, a printed screenshot of a state machine. This state machine has to be previously modeled by the researchers, with the tool of which the usability is to be evaluated. All required events, variables, and if necessary incomplete state machines are prepared a priori in the workspace. 3.3 Participants At best, real users of the tool can participate in a study. In most cases, however, this will not be the case. Therefore, in this chapter we describe the requirements that the participants must fulfill in order to meet the characteristics of the real users as much as possible. We assume that users of graphical modeling tools are expert users. Expert users (in contrast to casual users) use a tool regularly for a long period of time. In order to achieve a sufficient external validity, the recruited participants should be at least well experienced in working with computers and software in gen- eral, and with graphical modeling tools in particular. The graphical modeling language must also be well known, if not, participants should complete a train- ing beforehand. It is not required that participants have prior experience with the specific editor to be evaluated but they should have experience with any graphical modeling tool for at least six months to increase internal validity. To check whether participants fulfill all requirements, the TA-EG questionnaire [15] should be used. Additionally, in a questionnaire participants should be explicitly asked for experience in graphical modeling tools, see following Subsection 3.4. The TA-EG questionnaire (original German title: Technikaffintät – Elektro- nische Geräte, translated: Technology Affinity – Electronic Devices) can be used to measure affinity to technical devices like computers, mobile phones, or naviga- tion systems [15]. TA-EG consists of 19 Likert-scale questions grouped into the four categories enthusiasm, competence, negative attitudes, and positive attitudes. By using TA-EG, we assess the characteristics of our participants regarding each category. In this way, we can ensure that the participants have a positive atti- tude towards technology, which we assume corresponds to that of expert users in the context of modeling tools. Nielsen and Landauer [18] describe the number of participants that are re- quired for problem discovery studies as depending on two factors. First, it de- pends on the percentage of all usability problems that should at least be found. Second, it should take into consideration the probability of a single problem being found by a participant. Across eleven usability studies, Nielsen and Lan- dauer found the average probability of a problem being found ranges from 0.16 to 0.60 with an average of 0.31. Following the authors’ formulation, to detect 90 % of all problems, we require at least nine participants with a problem de- tection probability of 0.31 (average case). To discover 90 % of all problems with a problem detection probability of only 0.16 (worst case), we require at least 15 participants. Additionally, we require at least one informal participant for a pilot test to fix possible errors in the concrete study design. 3.4 Data Sources Our study design template benefits from multiple measures that generate quali- tative as well as quantitative data: think-aloud observation, a questionnaire, and a semi-structured interview. The three methodologies should be conducted one after the other as listed above. Using several data sources limits the effects of a possible wrong interpretation of one single data source. Triangulation can be used to increase the validity of observed problems [26]. During the think-aloud observation the participants have to use the system while continuously thinking out loud, and being observed by an experimenter. Thinking out loud means that participants have to verbalize their thoughts, describe what they actually expect and what happens instead. During an obser- vation session, audio and video are recorded. It should be noted that users might give false impressions or own theories of the cause of usability problems. The experimenter should focus on what users actually do instead of what users say they do. For example, a participant might criticize a missing button even though the participant just has not seen it. The real problem is therefore the visibility of the button and not its absence. Later analysis is needed to abstract the observed problems from the participants and identify the underlying usability problems. On the other hand, users’ comments on user interface elements, what they like and do not like, can be useful input for later improvements [17]. Directly after finishing the think-aloud observation, participants are asked to fill in a questionnaire. The questionnaire consists of three parts: the System Usability Score (SUS) for measuring the usability of the evaluated tool by a quantitative score, TA-EG questionnaire, and questions about previous experi- ence and demographic data. SUS [9] is a well established tool to measure usability with a quantitative scale. SUS is used to get a first impression of the usability of the evaluated tool. Furthermore, this score can be compared to possible improvements in the future or other evaluated tools. Our study design contains SUS because it consists of only ten Likert-scale questions. After a possible long lasting think-aloud obser- vation, participants do not want to fill in an extensive usability questionnaire. Although SUS is a short questionnaire, the resulting score, a value between 0 and 100, is meaningful and valid [4]. The numerical SUS score is not a percent- age, despite its appearance. To interpret the numerical score, we use adjective and grade rating scales developed by Bangor et al. [3]. Their work is based on analysis of 1,000 SUS surveys. The average score of the examined SUS-surveys is 68. For example, this numerical SUS score corresponds to ok on the adjective rating scale and to D on the grade scale. As introduced before in Subsection 3.2, the TA-EG questionnaire and asking for prior experience are used to validate the sample against the previously defined requirements. Beside just asking for prior experience in general, we recommend to ask explicitly for graphical modeling tools that participants are experienced with and for how long they already use them. This data helps to identify a possible bias that participants might have. Finally, a semi-structured interview should be used to explore what the users have in mind when working with the editor and to get an understanding of inter- esting observations during the think-aloud phase [16]. Each participant has to an- swer pre-determined questions about the overall experience, what was good, and what was bad when they worked with an editor. Aside from the pre-determined questions, some of the interview questions should be based on the observations during the think-aloud phase to get a deeper understanding of the users’ behavior in specific situations, e.g., error situations. Example The researcher noted a participant had problems clicking at a specific node. In the course of the semi-structured interview, the researcher may ask the participant to explain that situation in her or his own words, explain the actual goal and what happened instead. The researcher could also ask for possible improvements. 3.5 Qualitative Analysis Our study design mainly generates qualitative data. Coding is our proposed way to analyze and structure this data. A code is a short phrase or sentence. It is assigned to a text phrase of the transcribed interviews or video snippet recorded during the think-aloud observation. The code should be the essence or meaning of the coded data [25]. As initially defined, our study design addresses discovery of problems and the context in which these problems occur. We suggest the open coding technique to identify problems in the collected data. Open coding breaks down the data into first provisional, comparative codes [25]. Afterwards, we suggest to build up a category system with the focus on affected elements and performed actions by users. The category system should emerge from the coded data. Example A problem is observed when a participant tries to click at a small connection with the intention to change its position. Instead of clicking at the connection, the participant clicks at the underlying box and moves the box instead. One way to code the described problem might be connection (category context) and click hit (performed action). 4 Conclusion Although MDE has a vital research community, its ideas and results are still not well established in the field. One reason might be the poor usability of provided tools and/or provided frameworks to build tools. While there are several studies that quantitatively measure the usability of MDE tools, we try to discover concrete problems in order to work on their improvement. In this paper, we introduce a study design template that allows for conducting qualitative usability studies more easily. We describe the different aspects of such a study in detail (objects, tasks, participants, data sources, and qualitative analysis) and discuss our suggested methods. This more abstract description is supplemented by an ongoing example in which the usability of the GEF-based tool Yakindu was examined during the creation or modification of statecharts. We have actually conducted this study and are currently working to fully evaluate the results. References 1. Al-Wabil, A., Al-Khalifa, H.: A framework for integrating usability evaluations methods: The Mawhiba web portal case study. In: 2009 International Conference on the Current Trends in Information Technology (CTIT). pp. 1–6 (Dec 2009) 2. Andre, T.S., Hartson, H.R., Belz, S.M., McCreary, F.A.: The user action frame- work: a reliable foundation for usability engineering support tools. International Journal of Human-Computer Studies 54(1), 107–136 (2001) 3. Bangor, A., Kortum, P., Miller, J.: Determining What Individual SUS Scores Mean: Adding an Adjective Rating Scale 4(3), 114–123 (2009) 4. Bangor, A., Kortum, P.T., Miller, J.T.: An Empirical Evaluation of the System Usability Scale 24(6), 574–594 (2008) 5. Barišic, A., Amaral, V., Goulão, M., Barroca, B.: Evaluating the Usability of Domain-Specific Languages. In: Software Design and Development: Concepts, Methodologies, Tools, and Applications, pp. 2120–2141. IGI Global, Hershey, PA, USA (2014) 6. Basili, V.R., Shull, F., Lanubile, F.: Building knowledge through families of exper- iments. IEEE Transactions on Software Engineering 25(4), 456–473 (Jul 1999) 7. Bobkowska, A., Reszke, K.: Usability of UML Modeling Tools. In: Proceedings of the 2005 Conference on Software Engineering: Evolution and Emerging Technolo- gies. pp. 75–86. IOS Press, Amsterdam, The Netherlands (2005) 8. Bordeleau, F., Liebel, G., Raschke, A., Stieglbauer, G., Tichy, M.: Challenges and research directions for successfully applying MBE tools in practice. In: Pro- ceedings of MODELS 2017 Satellite Event: Workshops (ModComp, ME, EXE, COMMitMDE, MRT, MULTI, GEMOC, MoDeVVa, MDETools, FlexMDE, MDE- bug), Posters, Doctoral Symposium, Educator Symposium, ACM Student Re- search Competition, and Tools and Demonstrations. CEUR Workshop Proceed- ings, vol. 2019, pp. 338–343. CEUR-WS.org (2017) 9. Brooke, J.: SUS: A ”quick and dirty” usability scale. In: Usability Evaluation in Industry. Taylor and Francis (1986) 10. Condori-Fernández, N., Panach, J.I., Baars, A.I., Vos, T., Pastor, Ó.: An empirical approach for evaluating the usability of model-driven tools. Science of Computer Programming 78(11), 2245–2258 (2013) 11. Cuenca, F., Bergh, J.V.d., Luyten, K., Coninx, K.: A User Study for Comparing the Programming Efficiency of Modifying Executable Multimodal Interaction De- scriptions: A Domain-specific Language Versus Equivalent Event-callback Code. In: Proceedings of the 6th Workshop on Evaluation and Usability of Programming Languages and Tools. pp. 31–38. PLATEAU 2015, ACM, New York, NY, USA (2015) 12. Hornbæk, K.: Current practice in measuring usability: Challenges to usability stud- ies and research. International Journal of Human-Computer Studies 64(2), 79–102 (2006) 13. ISO: ISO 9241-11:1998 Ergonomic requirements for office work with visual dis- play terminals (VDTs) – Part 11: Guidance on usability. Tech. rep., International Organization for Standardization (1998) 14. John, B.E., Kieras, D.E.: The GOMS Family of User Interface Analysis Techniques: Comparison and Contrast. ACM Trans. Comput.-Hum. Interact. 3(4), 320–351 (Dec 1996) 15. Karrer, K., Glaser, C., Clemens, C., Bruder, C.: Technikaffinität erfassen–der Fragebogen TA-EG. pp. 196–201. No. 8 in Der Mensch im Mittelpunkt technis- cher Systeme (2009) 16. Lazar, J., Feng, J.H., Hochheiser, H.: Research Methods in Human-Computer In- teraction. Wiley (2010) 17. Nielsen, J.: Usability Engineering. Academic Press (1993) 18. Nielsen, J., Landauer, T.K.: A mathematical model of the finding of usability problems. pp. 206–213. ACM Press (1993) 19. Poltronieri, I., Zorzo, A.F., Bernardino, M., de Borba Campos, M.: Usa-dsl: Us- ability evaluation framework for domain-specific languages. In: Proceedings of the 33rd Annual ACM Symposium on Applied Computing. pp. 2013–2021. SAC ’18, ACM, New York, NY, USA (2018) 20. Poltronieri Rodrigues, I., Campos, M.d.B., Zorzo, A.F.: Usability Evaluation of Domain-Specific Languages: A Systematic Literature Review. In: Human- Computer Interaction. User Interface Design, Development and Multimodality, LNCS, vol. 10271, pp. 522–534. Springer (2017) 21. Post, G., Kagan, A.: User requirements for OO CASE tools. Information and Soft- ware Technology 43(8), 509–517 (2001) 22. Purchase, H.: Which aesthetic has the greatest effect on human understanding? In: Graph Drawing. pp. 248–261. Springer (1997) 23. Rouly, J.M., Orbeck, J.D., Syriani, E.: Usability and Suitability Survey of Features in Visual Ides for Non-Programmers. In: Proceedings of the 5th Workshop on Eval- uation and Usability of Programming Languages and Tools. pp. 31–42. PLATEAU ’14, ACM, New York, NY, USA (2014) 24. Safdar, S.A., Iqbal, M.Z., Khan, M.U.: Empirical Evaluation of UML Modeling Tools–A Controlled Experiment. In: Modelling foundations and applications, Lec- ture Notes in Computer Science, vol. 9153, pp. 33–44. Springer, Cham (2015) 25. Saldaña, J.: The Coding Manual for Qualitative Researchers. SAGE, 3rd edn. (2016) 26. Yin, R.K.: Case Study Research: Design and Methods. SAGE, 5th edn. (2014)