Comparing Comprehensibility of Modelling Languages for Specifying Behavioural Requirements Grischa Liebel Matthias Tichy Software Engineering Division Ulm University, Germany Chalmers | University of Gothenburg, Sweden matthias.tichy@uni-ulm.de grischa@chalmers.se Abstract—The selection of a suitable modelling language influ- sequence-based notations (MSDs) and state-based notations ences the success of software modelling. Several experiments com- (TAs), as they are similar in terms of expressiveness and hence paring the comprehensibility of graphical modelling languages possible alternatives for expressing the same requirements, and have been published. However, no published study comparing the comprehensibility of functional requirements modelled in as both have been applied in industrial case studies, e.g. [10], different graphical modelling languages exists. This paper eval- [11], [12]. Additionally, both languages were already used in a uates how two requirements modelled in a sequence-based nota- joint project with industrial partners. The experiment is based tion, Modal Sequence Diagrams, and in a state-based notation, on an extensive and detailed requirements specification by a Timed Automata, compare with respect to comprehensibility. A vehicle manufacturer that defines the behaviour of a software controlled experiment with 22 student from an undergraduate course on software modelling was performed. Our results show function to be realised by a supplier. Hence, the requirements no significant differences with respect to the comprehensibility specification is quite detailed and as such a reasonable candidate of the two different languages, but subjects who answered the for modelling. We used 22 undergraduate students in a course questionnaire for the sequence-based notation completed signifi- on software modelling as subjects. Our results show no cantly more answers in the given time limit. These initial resultssignificant difference with respect to the comprehensibility of indicate that choosing a modelling language for requirements modelling based on convenience does not significantly affect the requirements modelled in the two languages, but requirements understanding of the resulting requirements. modelled in MSDs are significantly quicker to understand. This indicates that the current practice, selecting visual languages I. I NTRODUCTION based on convenience, is in fact feasible with respect to the Choosing a visual modelling language in practice is typically comprehension of the resulting requirements specifications. dependent on previous experience with the modelling language However, it might take longer to understand the requirements or the availability of the respective modelling tools. While both depending on the chosen language. aspects are certainly reasonable, other criteria are similarly The remainder of this paper is structured as follows. In important. Section II, related literature is discussed. Section III covers Comprehensibility is a commonly evaluated criteria of the basics of the two visual modelling languages we used in visual languages, e.g., for UML diagrams in [1], [2], [3], [4]. the experiment. Section IV describes the experiment design, However, evaluation results can be contradicting as in [1] and followed by a discussion of validity threats in Section V. [2]. Similarly to visual languages, the comprehensibility or Section VI presents the actual results and discusses them in understandability of software requirements specifications is depth. The paper is concluded in Section VII. the most common aspect evaluated in empirical requirements engineering studies [5]. At the same time, their practical II. R ELATED W ORK value is often questioned [6]. A recent family of experiments by Abrahão et al. reports that providing sequence diagrams In the context of the UML [13], a number of experimental together with a natural language specification increases com- studies have been published that compare different modelling prehensibility [7]. Additionally, the authors state that possible languages with respect to comprehensibility. The compre- future work in this area could be “experiments to analyse the hensibility of UML behavioural diagrams, namely sequence, effect of different behavioural diagrams in the comprehension collaboration, and state machine diagrams, in both real-time of software models”. and management information systems is compared using a As a first step in this direction, we conducted a controlled controlled experiment by Otero and Dolado [1]. The results experiment in order to understand which behavioural diagrams show that sequence diagrams are more comprehensible for real- perform superior to others for the specific case of modelling time systems than for management information systems. With functional requirements with respect to comprehensibility. respect to the answering speed, their data shows that sequence Specifically, we compared two modelling languages, Modal diagrams perform better than collaboration and state machine Sequence Diagrams (MSDs) [8] and Timed Automata (TA) [9]. diagrams for both domains. As subjects, 31 undergraduate We chose these two languages as they are representatives for students are used in their study. 17 In contrast to this, Glezer et al. report that sequence diagrams visual modelling languages. However, the outcomes vary and are more comprehensible for management information systems are sometimes even contradicting, e.g. in [1] and [2]. than for real-time systems [2]. The authors mainly attribute Additionally, we are not aware of any experiment comparing this difference to the previous knowledge of the subjects, who requirements represented by behavioural models only. This were not experienced in real-time systems. In this study, the is a gap in knowledge, as requirements are typically on a 76 student subjects performed the experiment in terms of a more abstract level than for example software design and are, mandatory mid-term exam. additionally, often intended to be read and understood by non- Nugroho investigates the impact of detail on the compre- experts. Particularly, in the automotive domain, it is the usual hension of UML Class, Sequence, Package, and Use Case process that detailed requirements specificions covering the diagrams in form of a controlled experiment with 53 graduate behaviour of software components are defined by the vehicle students [3]. The author reports that a low level of detail can manufacturer and subsequently sent to the supplier who needs lead to misinterpretations and that the subjects’ knowledge did to correctly understand and realise the specified behaviour. We not have an impact on the comprehension. are filling this gap with our contribution in this paper. Staron et al. report the results from four controlled ex- periments studying the impact of using UML stereotypes III. BACKGROUND on comprehensibility conducted with 68 students and 4 In the following, we introduce the two compared modelling professionals in total [4]. The studies show that stereotypes languages and illustrate them using sample models from our indeed improve the comprehensibility and the total and relative experiment. The models specify the behaviour for the case times for answering the used questionnaires. that a user wants to increase the speed of a wiper by one unit. Similar to the UML, comprehensibility is a commonly Since both languages basically employ the same modelling studied characteristic of requirement specifications. Condori- notations to specify real-time aspects, we did not use any real- Fernández et al. present an evaluation of empirical studies time aspects but instead focused on non real-time behaviour. until 2008 on requirements comprehensibility [6]. The authors Furthermore, a pilot of the experiment showed that the real- conclude that while comprehensibility studies are common, time aspects were too difficult to understand for the planned many of them have practical limitations, such as using made- experiment. We plan a future experiment specifically targeting up examples instead of real specifications. the real-time aspects. Kamsties et al. study how different specification techniques affect the comprehensibility of a software requirements specifi- A. Modal Sequence Diagrams cation, using a re-engineered specification of a bicycle computer Modal Sequence Diagrams (MSDs) [8] are a recent variant [14]. The authors report that black-box specification techniques, of Live Sequence Charts (LSCs) [16] to model the behaviour describing a system by its externally visible behaviour, lead of a set of objects. MSDs/LSCs are sequence diagrams that, to a faster and more correct answering of the used instrument by different modalities assigned to messages and conditions, than white-box specification techniques, where the system is allow to precisely describe scenarios with liveness (something described by the behaviour between its entities. good must happen) and safety (something bad must not Finally, there are a number of studies which investigate the happen) properties. Notably, LSCs and MSDs define how comprehensibility of requirements modelled in or enhanced multiple scenarios can be active concurrently and synchronise with visual modelling languages. Scanniello et al. study the on common events as well as activate and de-activate MSDs. effect on requirements comprehensibility when using SysML This allows engineers to flexibly specify systems that fulfill diagrams in addition to natural language, compared to only different tasks at the same time. natural language requirements [15]. The authors use students One key advantage of MSDs/LSCs is that they can be as subjects in two controlled experiments and report that executed with the play-out algorithm, which allows engineers comprehensibility is increased when SysML diagrams are and other stakeholders to understand the behaviour emerging provided, whereas completion time for the comprehension from the interplay of the scenarios [17]. Furthermore, it is task is unaffected. possible to analyse whether a set of scenarios can be realised, A recent paper by Abrahão et al. reports a family of five i.e., it does not contain contradictions or results in deadlocks. experiments on the comprehensibility of functional require- Figure 1 shows a sample MSD used in our experiment. It ments modelled with sequence diagrams in addition to the specifies the communication between a user, a wiper controller natural language specification [7]. Hereby, one experiment uses as well as the actual wiper actuator. The sequence in the figure undergraduate students, two experiments use master students, describes that (1) a request is sent to the wiper controller one experiment uses doctoral students and one experiment uses to increase the speed, (2) it is checked whether the wiper is professionals as subjects. Four out of five experiments show in the state active, and (3) the controller sends a message statistically significant support for improved comprehensibility to the actuator to increase the speed by one. If the check in when using sequence diagrams. step 2 fails, the MSD will be de-activated and not further In summary, a number of experiments exist that investigate executed. Once the first message in an MSD is executed the comprehensibility of visual modelling languages, of re- (wiperRequest(WiperRequest::WIPER INCREASE) in Figure quirements specifications, and of requirements represented in 1), it is called active. After the last message, the MSD is 18 MSD Start_Increase IV. E XPERIMENT D ESIGN wiper: act: usr: User WiperController WiperActuator The evaluation of comprehensibility of the two considered wiperRequest(WiperRequest::WIPER_INCREASE) 0 modelling languages used for requirements engineering was wiper.wiperState == WiperState::WIPER_ACTIVE performed using a controlled experiment. The goal of this exper- 1 iment is formulated as follows, using the Goal/Question/Metric addToCurrentSpeed(1) paradigm [19]: • Analyse requirements modelled in two different modelling Fig. 1. Sample MSD languages for the purpose of comparison with respect to comprehensibility from the point of view of software de- velopers in the following context: application (verification WiperRequest? and validation), subjects (students). wiperRequestSignal==WiperRequest_INCREASE && Actuator_WiperState==WiperState_ACTIVE We used a between-subject randomised design with two treatments [20]. The between-subject design was chosen to s1 s2 avoid learning effects. The treatments are the used modelling language, namely MSD and TA. MSDs, a variant of Live addToCurrentSpeedSignal==1 Sequence Charts [21], are sequence diagrams with assigned Actuator_WiperSpeed := Actuator_WiperSpeed +1 AddToCurrentSpeed? modalities that allow the expression of liveness and safety properties, and real-time constraints. Timed Automata are a modification of Finite Automata for the specification and addToCurrentSpeedSignal!=1 verification of real-time systems. Hence, both languages are error AddToCurrentSpeed? very similar in terms of expressiveness. However, MSDs use false a scenario-based description covering multiple objects in one Fig. 2. Sample TA MSD, whereas TA use a state-based description covering a single object in one TA. MSDs were chosen in order to have a sequence-based language with executable semantics, in contrast deactivated again. The numbers on the right side describe the to UML Sequence Diagrams, and with the possibility to model so-called cut, the positions in which an MSD can be. required and forbidden behaviour. In order to not introduce The complete MSD model consists of a set of five scenarios any bias, we chose TA as a second modelling language, as covering the communication and conditions for the three the language had not either been introduced during the course. mentioned objects. Both languages are used without their timing functionality. Subjects were assigned randomly one of the two treatments. In the following subsections, the details of the experiment design B. Timed Automata are presented. Timed automata [18] are a state-based formalism which A. Subjects extends finite automata with a set of real-valued variables called We performed the experiment with 22 students from an clocks as well as various real-time constraints. Several timed undergraduate course on software modelling. This is due to automata can be combined into a network of timed automata availability reasons, as we had a scheduled university course where different automata synchronise their behaviour by, so in the end of 2014 in which we could perform the experiment. called, synchronisation channels. Synchronisation channels can The students had basic knowledge of UML, as the experiment be used as a means to specify synchronous message passing. was performed towards the end of the course. Both modelling Timed automata can be both simulated as well as verified for languages were only introduced prior to the experiment in a correctness using model checking. single 45-minute lecture. However, the students were introduced Figure 2 shows the timed automaton for the wiper con- to similar languages earlier in the course, namely to UML troller covering the increase wiper speed scenario as de- sequence diagrams and to UML state machine diagrams. scribed previously for the MSD. It defines that if a wiper request for increasing the speed (condition: wiperRequestSig- B. Instrumentation nal==WiperRequest INCREASE) using the synchronisation As a basis for the experimental objects, which we used in the channel WiperRequest? is received and the wiper is active study, we selected requirements from a real-life project within (condition: Actuator WiperState==WiperState ACTIVE), then the automotive domain from an industrial partner. The selected the wiper speed is increased by 1 and a helper variable is set to requirements describe joint behaviour, i.e. the requirements are 1 (addToCurrentSpeedSignal:=1). This helper variable is used not entirely independent. As these requirements are confidential, in another automaton for a long-press functionality. we abstracted them and changed their actual content resembling The complete TA model consists of a network of five timed a car wiper specification. However, we ensured that the automata covering the communication and conditions for the complexity and the logic is comparable. These requirements three mentioned objects. were then modelled by the main author of this paper using 19 MSD and TA. The resulting experimental objects consist of Precondition: Actuator_WiperSpeed = Constants_SLOW Actuator_WiperState = WiperState_ACTIVE two requirements models, SM SD and ST A , consisting of five Wiper_VehicleStatus = VehicleStatus_RUNNING Wiper_WiperConfiguration = WiperConfig_INSTALLED! diagrams each. The diagrams specify the activation of a car 1. WiperRequest is triggered, with wiperRequestSignal set to wiper in slow mode and in fast mode, the increase of the WiperRequest_OFF 2. SetWiperSpeed is triggered, with setWiperSpeedSignal set to wiper’s speed in two different ways, and the deactivation of Constants_OFF the wiper. Additionally, the experimental objects contained a Question 1: Does the input scenario violate the specified behaviour? Answer 1: No , Yes, in step: 1 ☐, 2 ☐ single page describing the context of each treatment. For the Question 2: Which values do the following variables have MSD specification, this consisted of a UML class diagram and - after the execution of the input scenario (if A1 is ‘No’) an UML object diagram, and for the TA specification, this - before the violating step (if A1 is ‘Yes’)? Answer 2: page contained the system declarations. Actuator_WiperSpeed = Constants_OFF Actuator_WiperState = WiperState_ACTIVE Finally, the instrument contained one page of syntax and Wiper_VehicleStatus = VehicleStatus_RUNNING Wiper_WiperConfiguration = WiperConfig_INSTALLED semantic explanation additionally to the introduction lecture and a questionnaire. In turn, the questionnaire consisted of a pre- Precondition: act.wiperSpeed act.wiperState = Constants.SPEED_SLOW = WiperState::WIPER_ACTIVE experiment part, collecting demographic data about the subjects wiper.vehicleStatus = VehicleStatus::RUNNING wiper.configuration = WiperConfig::WIPER_INSTALLED (including subjects’ knowledge regarding modelling languages), 1. usr sends Message ‘wiperRequest(WiperRequest::WIPER_OFF)’ to a post-experiment part, collecting subjective judgment, and the wiper 2. wiper sends Message ‘setWiperSpeed(Constants.SPEED_OFF)’ to actual measurement questionnaire consisting of 12 questions act targeting the subjects’ understanding. The pre- and post- Question 1: Does the input scenario violate the specified behaviour? Answer 1: No , Yes, in step: 1 ☐, 2 ☐ experiment questionnaires were used to judge whether previous Question 2: Which values do the following variables have experience, understanding of the introduction lectures, or other - after the execution of the input scenario (if A1 is ‘No’) factors might have affected the dependent variables. Due to - before the violating step (if A1 is ‘Yes’)? Answer 2: space limitations, we only discuss the data obtained from these act.wiperSpeed = Constants.SPEED_OFF act.wiperState = WiperState::WIPER_ACTIVE questionnaires briefly in Section VI. Each of the 12 questions wiper.vehicleStatus = VehicleStatus::RUNNING wiper.configuration = WiperConfig::WIPER_INSTALLED consisted of an initial state of the system and a number of executed messages or commands. Then, 2 sub-questions were Fig. 3. Example Question for TA Model (above) and MSD Model (below) asked. The subjects first had to answer whether the execution violated the requirements or not. Additionally, we asked in which state the system was after the execution (or right before think that an accurate understanding of a specification is more the requirements violation), either by asking for the system’s important than speed. This is why we chose AScore as a metric variable values or by asking for the active cuts/states of each for measuring how correct a question is answered in average. diagram. Both sub-questions were awarded with one point For completeness, we also added Score, which is related each. The second sub-question was only counted if the first to the other two metrics by Score = Answered ∗ AScore. sub-question was correct, as it was otherwise already clear We opted for comprehensibility instead of letting subjects that the subject had wrongly executed the requirements. An create diagrams themselves, as this is easier and requires example question with solutions for both the MSD and the TA less training. Furthermore, the experiment targets models of model is depicted in Figure 3. functional requirements, not simply behavioural models in This questionnaire approach has been successfully applied general. Therefore, we argue that comprehensibility is of in many similar studies, e.g. in [7], [15], [1]. The instrument, particular importance, as the aim of requirements is to document together with the resulting raw data, is published at http://www. what a system shall fulfill. Hence, correctly understanding these grischaliebel.de/data/research/instrument exp msd ta.zip. requirements is crucial. C. Variables An additional variable which can influence the outcome of the experiment is the subjects’ knowledge regarding modelling There is only a single independent variable in the performed languages and their domain knowledge in the automotive experiment. This is the used visual modelling language with domain. While all students are from the same course, they the values MSD or TA. We measured the comprehensibility might have different previous knowledge and experience. To of the used requirements specification using three dependent address this issue we employed a pre-experiment survey which variables: asked for background information, such as previous courses Answered: The number of answered questions. on modelling taken by the subject. AScore: The average score achieved per answered question. Score: The total score achieved for all 12 questions. D. Hypotheses Instead of measuring the time, we decided to design the In the course of the experiment, we used the following null instrument in a way that it would be difficult to answer all and alternative hypotheses, H0 and H1 , which we formulated questions in the given time frame. Therefore, we use the number as follows. of answered questions, Answered, instead of the needed time. • H0 : There is no significant difference between Modal We are foremost interested in using modelling languages for Sequence Diagrams and Timed Automata with respect to verification and validation purposes later on. Therefore, we comprehensibility of requirements specifications. 20 • H1 : There are significant differences between Modal has correctly understood the model. If this one is already Sequence Diagrams and Timed Automata with respect to incorrect, we automatically awarded 0 points to the second comprehensibility of requirements specifications. sub-question as well. Additionally, the second sub-question We evaluated the hypotheses separately for each of the was much harder to get right by chance. dependent variables. Each of the variables was tested for B. Internal Validity significance using a non-parametric Mann-Whitney U test. In order to avoid maturation or learning effects, subjects were Additionally, we tested for equality of variances for each only allowed to participate in the experiment once and only of the variables using a Levene test in order to fulfill the in one group, and were not allowed to exchange information assumptions of the Mann-Whitney U test. For both tests, we with other subjects during the experiment. Additionally, we used a significance value of 0.05. used a pre-experiment questionnaire in order to assess the E. Operation subjects domain and modelling knowledge, which might affect The experiment was piloted with two PhD students prior to the outcome. While all students came from the same course, execution. The instrument turned out to be too complicated they had different previous experience with respect to software and was therefore simplified furthermore to its current form. modelling and requirements engineering. We also assured that The experiment was conducted in a 90-minute lecture. the subjects voluntarily participated in the experiment, by not Participation was voluntary and the students received no giving rewards in the form of improved course grades or similar, benefits for the modelling course, such as bonus points or higher in order to avoid compensation rivalry or demoralisation. grades. In the first 45 minutes, both visual modelling languages However, we can not entirely rule out that some subjects were introduced. While this is a rather short time for introducing participated to win our appraisal later in the course. The fact two new languages, we were limited to this time frame by that we used volunteers might bias the results, as they could the course schedule. Additionally, the subjects had previous have been more motivated than the average. knowledge in similar languages from the course, so that it C. External Validity was possible to related the newly introduced languages to that We used parts of a real-life specification instead of a knowledge. Prior to the introduction lecture, we already handed toy example for the experiment instrument. However, the out the experimental objects, so that the subjects knew which requirements had to be abstracted as the original specification treatment they would receive and could concentrate on that is confidential. Additionally, while modelling the requirements, language during the lecture. Additionally, they could familiarise we had to ensure that both treatments were modelled in the themselves with the model. The subjects were encouraged same way and exhibited the same behaviour. This could have not to share or exchange the objects with each other. After lead to one of the treatments being modelled in a way which the introduction lecture, we handed out the remaining parts would not happen in practice, and thus limit generalisability. of the instrument, namely the questionnaires and the syntax We tried to reduce this threat by iteratively discussing and help. Subjects then received 3 minutes for filling out the pre- improving the instrument among the authors of this paper. experiment questionnaire, 40 minutes to fill out the experiment Additionally, the fact that we used student subjects possibly questionnaire, and finally 2 minutes for the post-experiment limits the generalisability to an industrial context. Finally, questionnaire. the specification is based on an automotive requirements V. VALIDITY specification, which can limit the generalisability to other domains. We will in the following discuss means which we took in order to ensure validity. We use the four aspects of validity as D. Conclusion Validity presented in Wohlin et al. [20]. We tried to avoid ambiguous wording of questions in the questionnaire by iteratively reviewing and improving it. A. Construct Validity Additionally, we performed a pilot experiment with two PhD In order to avoid inadequate preoperational explication of students prior to the actual experiment, in order to improve constructs, we have explicitly defined what ’comprehensibility’ both the introduction material and the questionnaire. Reliability means with respect to our study. Also, it is clearly defined that of treatment implementation is given, as the introduction lecture a higher score in any of the three dependent variables means a was only given once for the actual experiment. We did only better result for that variable. Our dependent variables do not perform statistical tests on the three dependent variables, which require any human judgment and are therefore objective. Mono- were defined up-front, and did not fish for results [20]. operation bias can currently not entirely be ruled out, as we only used one experimental object. We are planning to replicate VI. R ESULTS AND D ISCUSSION the experiment with another requirements specification in the In the following, we will discuss first the demography of future in order to address this. Mono-method bias is addressed the subjects participating in the experiment. Afterwards, we by asking two sub-questions for each of the 12 experiment present and discuss the results of the hypothesis testing for questions. While the first of the two sub-questions is a simple the experiment. Finally, we finish with a discussion of the yes/no question, it is an additional check whether the subject post-experiment questionnaire. 21 A. Demographic Data led to only one subject finishing all questions. In the MSD Out of the 22 subjects, 19 are Bachelor students and 3 are group, half of the subjects finished all questions. Additionally, Master students. This can be explained through the fact that the four subjects in the TA group answered three or less questions, course in which we performed the experiment is on Bachelor whereas this is only the case for one subject in the MSD group. level, but can be taken as an elective course by first year The large difference in the two means for this variable already Master students. All 3 Master students were randomly assigned indicates that the null hypothesis can be rejected, which is the MSD treatment. Out of 22 subjects, 13 have a secondary confirmed by the significance test with p ≈ 0.021. Hence, school degree, 7 a Bachelor degree, 1 a Master degree, and there is a significant difference with respect to the number of 1 subject another degree as their highest degree. This means answered questions between the two treatments. A possible that 5 subjects on Bachelor level are already in possession of a explanation for this might be the nature of MSDs, compared Bachelor degree, and one Master student already has a Master to TAs. While a single MSD has to be taken into account degree. While this is certainly possible, it might also be caused only once it is activated, each automaton in a TA is ’active’ by misunderstanding the question. Most subjects already had by definition. This means that for each message in a given previous courses on related topics, such as Object-oriented scenario, all automata need to be studied, while only a subset programming or Software Architecture. Only 6 subjects stated of the MSDs needs to be considered. to not have taken any related courses previously. Additionally, we asked the subjects for their professional experience in Answered  (TA)   developing software, in modelling software, and in requirements 12   engineering. In both modelling software and in requirements 10   engineering, only 3 subjects answered that they had previous 8   professional experience, ranging from half a year to three years 6   of experience. In addition to this, 9 subjects stated that they 4   have professional experience in software development, with one subject each stating 0.3 years, 1 year, and 8 years of experience, 2   and 3 subjects each stating 2 and 3 years of experience. 0   1   2   3   4   5   6   7   8   9   10   11   12   Subject   B. Experiment Results The experiment was conducted on 4th December 2014 at Answered  (MSD)   Chalmers University in Gothenburg, Sweden. The answers 12   from the paper questionnaire were afterwards digitalised in 10   order to allow computerised data processing. An overview over 8   both the descriptive statistics and the significance testing for 6   all three variables is depicted in Tables I and II. 4   TABLE I 2   D ESCRIPTIVE S TATISTICS OF THE E XPERIMENT 0   1   2   3   4   5   6   7   8   9   10   Treatment Dependent variable Mean Standard Subject   deviation TA Answered 5.667 3.42 AScore 0.693 0.726 Fig. 4. Answered of TA an MSD treatment Score 5.583 7.669 MSD Answered 9.4 3.273 AScore 0.576 0.45 The second dependent variable, AScore, is depicted in Figure Score 6.2 5.453 5 for both TA and MSD treatment. Here, in the TA treatment there is a much larger variance in the data set, with both very high and very low values. For the MSD treatment, there TABLE II are few values in the extremes. The statistical test results in S IGNIFICANCE T ESTING OF THE E XPERIMENT p ≈ 0.947, so that the null hypothesis can not be rejected Dependent Significance Significance H0 rejected for this variable. We do not have an explanation for the large variable Level Levene Level Mann- differences between subjects in the TA treatment, but they Whitney U might be attributed to misunderstandings with respect to the Answered p ≈ 0.94 p ≈ 0.021 Yes AScore p ≈ 0.097 p ≈ 0.947 No modelling language. Several subjects achieved average scores Score p ≈ 0.707 p ≈ 0.464 No under 1 point, even though they stated in the post-experiment questionnaire that they were confident in their answers. We The results of the first dependent variable, Answered, are plan to replicate the experiment in the future which will include depicted in Figure 4 for both treatments. Clearly, subjects in some simple upfront questions in order to measure whether the TA group took longer to answer the questionnaire, which the subjects have really understood the languages well enough 22 and analyse whether this correlates with the self-assessment. μ  =  5.583   Score  (TA)   σ2  =  58.81   24   21   μ  =  0.693   20   AScore  (TA)   σ2  =  0.526   20   2   16   1.8   1.6   12   9   9   1.4   1.2   8   1   3   4   2   2   0.8   1   0   0   0   0   0.6   0   0.4   1   2   3   4   5   6   7   8   9   10   11   12   0.2   Subject   0   1   2   3   4   5   6   7   8   9   10   11   12   μ  =  6.2   Subject   Score  (MSD)   σ2  =  29.73   24   μ  =  0.576   AScore  (MSD)   σ2  =  0.202   20   2   16   16   14   1.8   1.6   12   1.4   9   8   7   1.2   6   5   1   4   2   2   0.8   1   0   0.6   0   0.4   1   2   3   4   5   6   7   8   9   10   0.2   Subject   0   1   2   3   4   5   6   7   8   9   10   Subject   Fig. 6. Score of TA and MSD treatment Fig. 5. AScore of TA and MSD treatments TABLE III C ORRELATIONS BETWEEN D EPENDENT VARIABLES AND D EMOGRAPHICS As the third variable Score is directly computed from AScore Dependend Previous Courses Bachelor/Master Confidence and Answered, it exhibits a similar pattern (see Figure 6). In the Variable r p r p r p TA group, two subjects achieved 20 or more points, close to the Answered 0.207 0.355 −0.247 0.267 0.56 0.007 maximum of 24. However, many subjects in this group have low AScore 0.122 0.588 0.116 0.606 0.513 0.015 Score 0.17 0.45 0.074 0.745 0.557 0.007 total scores. As subjects in the MSD group have significantly higher values in the Answered metric, their average Score values are higher, even though the average score AScore is lower for this group. average, a clear grasp of whether they understood the instrument In summary, we can state that MSDs are significantly quicker or not. Interestingly, both the number of previous courses to comprehend. Therefore, if speed is a relevant factor, MSDs and the education level only show a small correlation with should be chosen instead of TA. One could argue that speed the dependent variables. Similarly, previous experience in itself is not relevant, as long as AScore is low. Therefore, we Software Development, Software Modelling, and Requirements plan to replicate the experiment with subjects who are more Engineering has small correlation with the dependent variables, familiar with the modelling languages, in order to see whether as depicted in Table IV. These results could indicate that the difference in speed is still present. the dependent variables were in fact influenced by other factors, such as confusion regarding the newly introduced C. Correlation between Demographic Data and Dependent modelling languages. However, they could also indicate that Variables the understanding of the requirements is not dependent on We used the Pearson product-moment correlation coefficient previous education and experience. Further replications will be to assess the correlations between the three dependent variables necessary in order to answer these questions in a satisfactory and the number of related courses previously taken by manner. students, the education level (Bachelor/Master), and the subject’s confidence in their answers. The resulting values TABLE IV for Pearson’s r and the p-value are depicted in Table III. C ORRELATIONS BETWEEN D EPENDENT VARIABLES AND E XPERIENCE Assuming an effect size of r < 0.3 as small, an effect size of 0.3 ≤ r < 0.4 as medium, and an effect size of r ≥ 0.4 Dependend Software Dev. Modelling Req. Eng. Variable r p r p r p as large, we see that there is a large correlation between Answered 0.08 0.724 0.054 0.811 0.092 0.685 all three dependent variables and the subject’s confidence AScore −0.044 0.846 0.053 0.816 0.063 0.781 in their results. This result indicates that subjects had, in Score −0.09 0.692 0.028 0.901 0.039 0.863 23 VII. C ONCLUSIONS AND F UTURE W ORK R EFERENCES [1] M. C. Otero and J. J. Dolado, “Evaluation of the comprehension of In this paper, we have presented the results of a controlled the dynamic modeling in UML,” Information and Software Technology, experiment with 22 students in an undergraduate course on vol. 46, no. 1, pp. 35–53, 2004. software modelling. We studied the comprehensibility of [2] C. Glezer, M. Last, E. Nachmany, and P. Shoval, “Quality and com- prehension of uml interaction diagrams-an experimental comparison,” functional requirements modelled in two graphical languages, Information and Software Technology, vol. 47, no. 10, pp. 675–692, Modal Sequence Diagrams, a sequence-based notation, and 2005. Timed Automata, a state-based notation. Subjects received a [3] A. Nugroho, “Level of detail in uml models and its impact on model comprehension: A controlled experiment,” Information and Software model in one of the two languages and a questionnaire with Technology, vol. 51, no. 12, pp. 1670–1685, 2009. questions testing their understanding of the model. While we [4] M. Staron, L. Kuzniarz, and C. Wohlin, “Empirical assessment of can not reject the null hypothesis, that there are no significant using stereotypes to improve comprehension of uml models: A set of experiments,” Journal of Systems and Software, vol. 79, no. 5, pp. differences between the two treatments, for both the average 727–742, 2006. and the total questionnaire scores, subjects receiving the Modal [5] N. Condori-Fernández, M. Daneva, K. Sikkel, R. Wieringa, O. Dieste, Sequence Diagram specification answered significantly more and O. Pastor, “A systematic mapping study on empirical evaluation of software requirements specifications techniques,” in Proceedings of the questions. This indicates that if the speed or the efficiency plays 2009 3rd International Symposium on Empirical Software Engineering an important role, scenario-based models should be considered and Measurement. IEEE Computer Society, 2009, pp. 502–505. instead of the state-based models. However, further studies [6] N. Condori-Fernández, M. Daneva, K. Sikkel, and A. Herrmann, “Practical relevance of experiments in comprehensibility of requirements need to be conducted in order to understand whether this specifications,” in Empirical Requirements Engineering (EmpiRE), 2011 effect persists with more experienced users who achieve higher First International Workshop on, Aug 2011, pp. 21–28. overall scores. [7] S. Abrahão, C. Gravino, E. Insfran, G. Scanniello, and G. Tortora, “Assessing the effectiveness of sequence diagrams in the comprehension While our sample of students without a previous knowledge of functional requirements: Results from a family of five experiments,” of the used treatments can be seen as a possible threat to Software Engineering, IEEE Transactions on, vol. 39, no. 3, pp. 327–342, March 2013. validity, this lack of experience is in fact a realistic setup [8] D. Harel and S. Maoz, “Assert and negate revisited: Modal semantics for industrial use in the automotive domain. As requirements for UML sequence diagrams,” Software and Systems Modeling (SoSyM), specifications are used across organisations and across roles vol. 7, no. 2, pp. 237–252, May 2008. [9] R. Alur and D. L. Dill, “A Theory of Timed Automata,” Theoretical within an organisation, it can not be assumed that the receiver Computer Science, vol. 126, no. 2, pp. 183–235, 1994. of a specification is always familiar with every detail of the used [10] K. G. Larsen, M. Mikucionis, B. Nielsen, and A. Skou, “Testing real- language. Additionally, receivers are often no experts in mod- time embedded software using uppaal-tron: An industrial case study,” in Proceedings of the 5th ACM International Conference on Embedded elling, but in other areas such as requirements engineering or Software. ACM, 2005. system design. Therefore, in contrast to, for example, software [11] A. Fehnker, “Scheduling a steel plant with timed automata,” in rtcsa. development, the receivers of a requirements specification can IEEE, 1999, p. 280. [12] J. Greenyer, M. Haase, J. Marhenke, and R. Bellmer, “Evaluating a not be expected to be experts in the used language. Additionally, formal scenario-based method for the requirements analysis in automotive our results indicate that the current practice, choosing the software engineering,” in Proceedings of the 2015 10th Joint Meeting modelling language based on convenience, is not a threat to on Foundations of Software Engineering. ACM, 2015. [13] O. M. Group, “Unified modeling language,” http://www.uml.org/, Jun. the comprehension of the specifications in itself. 2014. In the future, we will replicate the experiment both with [14] E. Kamsties, A. von Knethen, and R. Reussner, “A controlled experiment different groups of students and with professionals from our to evaluate how styles affect the understandability of requirements specifications,” Information and Software Technology, vol. 45, no. 14, industrial partners in order to eliminate possible bias and to pp. 955–965, 2003, eighth International Workshop on Requirements assess whether experience and a deeper knowledge of the Engineering: Foundation for Software Quality. languages can have a significant impact on the understanding. [15] G. Scanniello, M. Staron, H. Burden, and R. Heldal, “On the effect of using SysML requirement diagrams to comprehend requirements: Results Additionally, we will aim at generating a theory on which from two controlled experiments,” in 18th International Conference on languages are suitable for which kind of task or system when Evaluation Assessment in Software Engineering (EASE), May 2014, pp. modelling requirements. 433–442. [16] W. Damm and D. Harel, “LSCs: Breathing life into message sequence charts,” in Formal Methods in System Design, vol. 19. Kluwer Academic, ACKNOWLEDGEMENT 2001, pp. 45–80. [17] D. Harel and R. Marelly, Come, Let’s Play: Scenario-Based Programming Using LSCs and the Play-Engine. Springer, August 2003. We would like to express our gratitude to Nadja Marko [18] J. Bengtsson and W. Yi, “Timed automata: Semantics, algorithms and and Christian Webel, who helped in reviewing and discussing tools,” in Lectures on Concurrency and Petri Nets, vol. 3098. Springer, an early experiment design. Additionally, we would like to 2003, pp. 87–124. [19] V. R. Basili, “Software modeling and measurement: The thank Pariya Kashfi and Vard Antinyan for participating in goal/question/metric paradigm,” Tech. Rep., 1992. the pilot experiment. The research leading to these results has [20] C. Wohlin, P. Runeson, M. Höst, M. C. Ohlsson, and B. Regnell, received partial funding from the European Union’s Seventh Experimentation in Software Engineering. Springer, 2012. [21] W. Damm and D. Harel, “LSCs: Breathing life into message sequence Framework Program (FP7/2007-2013) for CRYSTAL-Critical charts,” in Formal Methods in System Design, vol. 19. Kluwer Academic, System Engineering Acceleration Joint Undertaking under grant 2001, pp. 45–80. agreement No 332830 and from Vinnova under DIARIENR 2012-04304. 24