=Paper= {{Paper |id=Vol-1522/Liebel2015HuFaMo |storemode=property |title=Comparing Comprehensibility of Modelling Languages for Specifying Behavioural Requirements |pdfUrl=https://ceur-ws.org/Vol-1522/Liebel2015HuFaMo.pdf |volume=Vol-1522 |dblpUrl=https://dblp.org/rec/conf/models/LiebelT15 }} ==Comparing Comprehensibility of Modelling Languages for Specifying Behavioural Requirements== https://ceur-ws.org/Vol-1522/Liebel2015HuFaMo.pdf

Comparing Comprehensibility of Modelling
Languages for Specifying Behavioural Requirements
Grischa Liebel Matthias Tichy
Software Engineering Division Ulm University, Germany
Chalmers | University of Gothenburg, Sweden matthias.tichy@uni-ulm.de
grischa@chalmers.se

Abstract—The selection of a suitable modelling language influ- sequence-based notations (MSDs) and state-based notations
ences the success of software modelling. Several experiments com- (TAs), as they are similar in terms of expressiveness and hence
paring the comprehensibility of graphical modelling languages possible alternatives for expressing the same requirements, and
have been published. However, no published study comparing
the comprehensibility of functional requirements modelled in as both have been applied in industrial case studies, e.g. [10],
different graphical modelling languages exists. This paper eval- [11], [12]. Additionally, both languages were already used in a
uates how two requirements modelled in a sequence-based nota- joint project with industrial partners. The experiment is based
tion, Modal Sequence Diagrams, and in a state-based notation, on an extensive and detailed requirements specification by a
Timed Automata, compare with respect to comprehensibility. A vehicle manufacturer that defines the behaviour of a software
controlled experiment with 22 student from an undergraduate
course on software modelling was performed. Our results show function to be realised by a supplier. Hence, the requirements
no significant differences with respect to the comprehensibility specification is quite detailed and as such a reasonable candidate
of the two different languages, but subjects who answered the for modelling. We used 22 undergraduate students in a course
questionnaire for the sequence-based notation completed signifi- on software modelling as subjects. Our results show no
cantly more answers in the given time limit. These initial resultssignificant difference with respect to the comprehensibility of
indicate that choosing a modelling language for requirements
modelling based on convenience does not significantly affect the requirements modelled in the two languages, but requirements
understanding of the resulting requirements. modelled in MSDs are significantly quicker to understand. This
indicates that the current practice, selecting visual languages
I. I NTRODUCTION based on convenience, is in fact feasible with respect to the
Choosing a visual modelling language in practice is typically comprehension of the resulting requirements specifications.
dependent on previous experience with the modelling language However, it might take longer to understand the requirements
or the availability of the respective modelling tools. While both depending on the chosen language.
aspects are certainly reasonable, other criteria are similarly The remainder of this paper is structured as follows. In
important. Section II, related literature is discussed. Section III covers
Comprehensibility is a commonly evaluated criteria of the basics of the two visual modelling languages we used in
visual languages, e.g., for UML diagrams in [1], [2], [3], [4]. the experiment. Section IV describes the experiment design,
However, evaluation results can be contradicting as in [1] and followed by a discussion of validity threats in Section V.
[2]. Similarly to visual languages, the comprehensibility or Section VI presents the actual results and discusses them in
understandability of software requirements specifications is depth. The paper is concluded in Section VII.
the most common aspect evaluated in empirical requirements
engineering studies [5]. At the same time, their practical II. R ELATED W ORK
value is often questioned [6]. A recent family of experiments
by Abrahão et al. reports that providing sequence diagrams In the context of the UML [13], a number of experimental
together with a natural language specification increases com- studies have been published that compare different modelling
prehensibility [7]. Additionally, the authors state that possible languages with respect to comprehensibility. The compre-
future work in this area could be “experiments to analyse the hensibility of UML behavioural diagrams, namely sequence,
effect of different behavioural diagrams in the comprehension collaboration, and state machine diagrams, in both real-time
of software models”. and management information systems is compared using a
As a first step in this direction, we conducted a controlled controlled experiment by Otero and Dolado [1]. The results
experiment in order to understand which behavioural diagrams show that sequence diagrams are more comprehensible for real-
perform superior to others for the specific case of modelling time systems than for management information systems. With
functional requirements with respect to comprehensibility. respect to the answering speed, their data shows that sequence
Specifically, we compared two modelling languages, Modal diagrams perform better than collaboration and state machine
Sequence Diagrams (MSDs) [8] and Timed Automata (TA) [9]. diagrams for both domains. As subjects, 31 undergraduate
We chose these two languages as they are representatives for students are used in their study.

17
In contrast to this, Glezer et al. report that sequence diagrams visual modelling languages. However, the outcomes vary and
are more comprehensible for management information systems are sometimes even contradicting, e.g. in [1] and [2].
than for real-time systems [2]. The authors mainly attribute Additionally, we are not aware of any experiment comparing
this difference to the previous knowledge of the subjects, who requirements represented by behavioural models only. This
were not experienced in real-time systems. In this study, the is a gap in knowledge, as requirements are typically on a
76 student subjects performed the experiment in terms of a more abstract level than for example software design and are,
mandatory mid-term exam. additionally, often intended to be read and understood by non-
Nugroho investigates the impact of detail on the compre- experts. Particularly, in the automotive domain, it is the usual
hension of UML Class, Sequence, Package, and Use Case process that detailed requirements specificions covering the
diagrams in form of a controlled experiment with 53 graduate behaviour of software components are defined by the vehicle
students [3]. The author reports that a low level of detail can manufacturer and subsequently sent to the supplier who needs
lead to misinterpretations and that the subjects’ knowledge did to correctly understand and realise the specified behaviour. We
not have an impact on the comprehension. are filling this gap with our contribution in this paper.
Staron et al. report the results from four controlled ex-
periments studying the impact of using UML stereotypes III. BACKGROUND
on comprehensibility conducted with 68 students and 4 In the following, we introduce the two compared modelling
professionals in total [4]. The studies show that stereotypes languages and illustrate them using sample models from our
indeed improve the comprehensibility and the total and relative experiment. The models specify the behaviour for the case
times for answering the used questionnaires. that a user wants to increase the speed of a wiper by one unit.
Similar to the UML, comprehensibility is a commonly Since both languages basically employ the same modelling
studied characteristic of requirement specifications. Condori- notations to specify real-time aspects, we did not use any real-
Fernández et al. present an evaluation of empirical studies time aspects but instead focused on non real-time behaviour.
until 2008 on requirements comprehensibility [6]. The authors Furthermore, a pilot of the experiment showed that the real-
conclude that while comprehensibility studies are common, time aspects were too difficult to understand for the planned
many of them have practical limitations, such as using made- experiment. We plan a future experiment specifically targeting
up examples instead of real specifications. the real-time aspects.
Kamsties et al. study how different specification techniques
affect the comprehensibility of a software requirements specifi- A. Modal Sequence Diagrams
cation, using a re-engineered specification of a bicycle computer Modal Sequence Diagrams (MSDs) [8] are a recent variant
[14]. The authors report that black-box specification techniques, of Live Sequence Charts (LSCs) [16] to model the behaviour
describing a system by its externally visible behaviour, lead of a set of objects. MSDs/LSCs are sequence diagrams that,
to a faster and more correct answering of the used instrument by different modalities assigned to messages and conditions,
than white-box specification techniques, where the system is allow to precisely describe scenarios with liveness (something
described by the behaviour between its entities. good must happen) and safety (something bad must not
Finally, there are a number of studies which investigate the happen) properties. Notably, LSCs and MSDs define how
comprehensibility of requirements modelled in or enhanced multiple scenarios can be active concurrently and synchronise
with visual modelling languages. Scanniello et al. study the on common events as well as activate and de-activate MSDs.
effect on requirements comprehensibility when using SysML This allows engineers to flexibly specify systems that fulfill
diagrams in addition to natural language, compared to only different tasks at the same time.
natural language requirements [15]. The authors use students One key advantage of MSDs/LSCs is that they can be
as subjects in two controlled experiments and report that executed with the play-out algorithm, which allows engineers
comprehensibility is increased when SysML diagrams are and other stakeholders to understand the behaviour emerging
provided, whereas completion time for the comprehension from the interplay of the scenarios [17]. Furthermore, it is
task is unaffected. possible to analyse whether a set of scenarios can be realised,
A recent paper by Abrahão et al. reports a family of five i.e., it does not contain contradictions or results in deadlocks.
experiments on the comprehensibility of functional require- Figure 1 shows a sample MSD used in our experiment. It
ments modelled with sequence diagrams in addition to the specifies the communication between a user, a wiper controller
natural language specification [7]. Hereby, one experiment uses as well as the actual wiper actuator. The sequence in the figure
undergraduate students, two experiments use master students, describes that (1) a request is sent to the wiper controller
one experiment uses doctoral students and one experiment uses to increase the speed, (2) it is checked whether the wiper is
professionals as subjects. Four out of five experiments show in the state active, and (3) the controller sends a message
statistically significant support for improved comprehensibility to the actuator to increase the speed by one. If the check in
when using sequence diagrams. step 2 fails, the MSD will be de-activated and not further
In summary, a number of experiments exist that investigate executed. Once the first message in an MSD is executed
the comprehensibility of visual modelling languages, of re- (wiperRequest(WiperRequest::WIPER INCREASE) in Figure
quirements specifications, and of requirements represented in 1), it is called active. After the last message, the MSD is

18
MSD Start_Increase IV. E XPERIMENT D ESIGN
wiper: act:
usr: User WiperController WiperActuator The evaluation of comprehensibility of the two considered
wiperRequest(WiperRequest::WIPER_INCREASE) 0
modelling languages used for requirements engineering was
wiper.wiperState == WiperState::WIPER_ACTIVE
performed using a controlled experiment. The goal of this exper-
1 iment is formulated as follows, using the Goal/Question/Metric
addToCurrentSpeed(1)
paradigm [19]:
• Analyse requirements modelled in two different modelling

Fig. 1. Sample MSD languages for the purpose of comparison with respect to
comprehensibility from the point of view of software de-
velopers in the following context: application (verification
WiperRequest? and validation), subjects (students).
wiperRequestSignal==WiperRequest_INCREASE
&& Actuator_WiperState==WiperState_ACTIVE We used a between-subject randomised design with two
treatments [20]. The between-subject design was chosen to
s1 s2 avoid learning effects. The treatments are the used modelling
language, namely MSD and TA. MSDs, a variant of Live
addToCurrentSpeedSignal==1 Sequence Charts [21], are sequence diagrams with assigned
Actuator_WiperSpeed := Actuator_WiperSpeed +1
AddToCurrentSpeed?
modalities that allow the expression of liveness and safety
properties, and real-time constraints. Timed Automata are
a modification of Finite Automata for the specification and
addToCurrentSpeedSignal!=1 verification of real-time systems. Hence, both languages are
error
AddToCurrentSpeed? very similar in terms of expressiveness. However, MSDs use
false
a scenario-based description covering multiple objects in one
Fig. 2. Sample TA
MSD, whereas TA use a state-based description covering a
single object in one TA. MSDs were chosen in order to have a
sequence-based language with executable semantics, in contrast
deactivated again. The numbers on the right side describe the to UML Sequence Diagrams, and with the possibility to model
so-called cut, the positions in which an MSD can be. required and forbidden behaviour. In order to not introduce
The complete MSD model consists of a set of five scenarios any bias, we chose TA as a second modelling language, as
covering the communication and conditions for the three the language had not either been introduced during the course.
mentioned objects. Both languages are used without their timing functionality.
Subjects were assigned randomly one of the two treatments. In
the following subsections, the details of the experiment design
B. Timed Automata are presented.
Timed automata [18] are a state-based formalism which A. Subjects
extends finite automata with a set of real-valued variables called
We performed the experiment with 22 students from an
clocks as well as various real-time constraints. Several timed
undergraduate course on software modelling. This is due to
automata can be combined into a network of timed automata
availability reasons, as we had a scheduled university course
where different automata synchronise their behaviour by, so
in the end of 2014 in which we could perform the experiment.
called, synchronisation channels. Synchronisation channels can
The students had basic knowledge of UML, as the experiment
be used as a means to specify synchronous message passing.
was performed towards the end of the course. Both modelling
Timed automata can be both simulated as well as verified for
languages were only introduced prior to the experiment in a
correctness using model checking.
single 45-minute lecture. However, the students were introduced
Figure 2 shows the timed automaton for the wiper con- to similar languages earlier in the course, namely to UML
troller covering the increase wiper speed scenario as de- sequence diagrams and to UML state machine diagrams.
scribed previously for the MSD. It defines that if a wiper
request for increasing the speed (condition: wiperRequestSig- B. Instrumentation
nal==WiperRequest INCREASE) using the synchronisation As a basis for the experimental objects, which we used in the
channel WiperRequest? is received and the wiper is active study, we selected requirements from a real-life project within
(condition: Actuator WiperState==WiperState ACTIVE), then the automotive domain from an industrial partner. The selected
the wiper speed is increased by 1 and a helper variable is set to requirements describe joint behaviour, i.e. the requirements are
1 (addToCurrentSpeedSignal:=1). This helper variable is used not entirely independent. As these requirements are confidential,
in another automaton for a long-press functionality. we abstracted them and changed their actual content resembling
The complete TA model consists of a network of five timed a car wiper specification. However, we ensured that the
automata covering the communication and conditions for the complexity and the logic is comparable. These requirements
three mentioned objects. were then modelled by the main author of this paper using

19
MSD and TA. The resulting experimental objects consist of Precondition: Actuator_WiperSpeed = Constants_SLOW
Actuator_WiperState = WiperState_ACTIVE
two requirements models, SM SD and ST A , consisting of five Wiper_VehicleStatus = VehicleStatus_RUNNING
Wiper_WiperConfiguration = WiperConfig_INSTALLED!
diagrams each. The diagrams specify the activation of a car 1. WiperRequest is triggered, with wiperRequestSignal set to
wiper in slow mode and in fast mode, the increase of the WiperRequest_OFF
2. SetWiperSpeed is triggered, with setWiperSpeedSignal set to
wiper’s speed in two different ways, and the deactivation of Constants_OFF

the wiper. Additionally, the experimental objects contained a Question 1: Does the input scenario violate the specified behaviour?
Answer 1: No , Yes, in step: 1 ☐, 2 ☐
single page describing the context of each treatment. For the
Question 2: Which values do the following variables have
MSD specification, this consisted of a UML class diagram and - after the execution of the input scenario (if A1 is ‘No’)
an UML object diagram, and for the TA specification, this - before the violating step (if A1 is ‘Yes’)?
Answer 2:
page contained the system declarations. Actuator_WiperSpeed = Constants_OFF
Actuator_WiperState = WiperState_ACTIVE
Finally, the instrument contained one page of syntax and Wiper_VehicleStatus = VehicleStatus_RUNNING
Wiper_WiperConfiguration = WiperConfig_INSTALLED
semantic explanation additionally to the introduction lecture
and a questionnaire. In turn, the questionnaire consisted of a pre- Precondition: act.wiperSpeed
act.wiperState
= Constants.SPEED_SLOW
= WiperState::WIPER_ACTIVE
experiment part, collecting demographic data about the subjects wiper.vehicleStatus = VehicleStatus::RUNNING
wiper.configuration = WiperConfig::WIPER_INSTALLED
(including subjects’ knowledge regarding modelling languages), 1. usr sends Message ‘wiperRequest(WiperRequest::WIPER_OFF)’ to
a post-experiment part, collecting subjective judgment, and the wiper
2. wiper sends Message ‘setWiperSpeed(Constants.SPEED_OFF)’ to
actual measurement questionnaire consisting of 12 questions act

targeting the subjects’ understanding. The pre- and post- Question 1: Does the input scenario violate the specified behaviour?
Answer 1: No , Yes, in step: 1 ☐, 2 ☐
experiment questionnaires were used to judge whether previous
Question 2: Which values do the following variables have
experience, understanding of the introduction lectures, or other - after the execution of the input scenario (if A1 is ‘No’)
factors might have affected the dependent variables. Due to - before the violating step (if A1 is ‘Yes’)?
Answer 2:
space limitations, we only discuss the data obtained from these act.wiperSpeed = Constants.SPEED_OFF
act.wiperState = WiperState::WIPER_ACTIVE
questionnaires briefly in Section VI. Each of the 12 questions wiper.vehicleStatus = VehicleStatus::RUNNING
wiper.configuration = WiperConfig::WIPER_INSTALLED
consisted of an initial state of the system and a number of
executed messages or commands. Then, 2 sub-questions were
Fig. 3. Example Question for TA Model (above) and MSD Model (below)
asked. The subjects first had to answer whether the execution
violated the requirements or not. Additionally, we asked in
which state the system was after the execution (or right before think that an accurate understanding of a specification is more
the requirements violation), either by asking for the system’s important than speed. This is why we chose AScore as a metric
variable values or by asking for the active cuts/states of each for measuring how correct a question is answered in average.
diagram. Both sub-questions were awarded with one point For completeness, we also added Score, which is related
each. The second sub-question was only counted if the first to the other two metrics by Score = Answered ∗ AScore.
sub-question was correct, as it was otherwise already clear We opted for comprehensibility instead of letting subjects
that the subject had wrongly executed the requirements. An create diagrams themselves, as this is easier and requires
example question with solutions for both the MSD and the TA less training. Furthermore, the experiment targets models of
model is depicted in Figure 3. functional requirements, not simply behavioural models in
This questionnaire approach has been successfully applied general. Therefore, we argue that comprehensibility is of
in many similar studies, e.g. in [7], [15], [1]. The instrument, particular importance, as the aim of requirements is to document
together with the resulting raw data, is published at http://www. what a system shall fulfill. Hence, correctly understanding these
grischaliebel.de/data/research/instrument exp msd ta.zip. requirements is crucial.
C. Variables An additional variable which can influence the outcome of
the experiment is the subjects’ knowledge regarding modelling
There is only a single independent variable in the performed languages and their domain knowledge in the automotive
experiment. This is the used visual modelling language with domain. While all students are from the same course, they
the values MSD or TA. We measured the comprehensibility might have different previous knowledge and experience. To
of the used requirements specification using three dependent address this issue we employed a pre-experiment survey which
variables: asked for background information, such as previous courses
Answered: The number of answered questions. on modelling taken by the subject.
AScore: The average score achieved per answered question.
Score: The total score achieved for all 12 questions. D. Hypotheses
Instead of measuring the time, we decided to design the In the course of the experiment, we used the following null
instrument in a way that it would be difficult to answer all and alternative hypotheses, H0 and H1 , which we formulated
questions in the given time frame. Therefore, we use the number as follows.
of answered questions, Answered, instead of the needed time. • H0 : There is no significant difference between Modal
We are foremost interested in using modelling languages for Sequence Diagrams and Timed Automata with respect to
verification and validation purposes later on. Therefore, we comprehensibility of requirements specifications.

20
• H1 : There are significant differences between Modal has correctly understood the model. If this one is already
Sequence Diagrams and Timed Automata with respect to incorrect, we automatically awarded 0 points to the second
comprehensibility of requirements specifications. sub-question as well. Additionally, the second sub-question
We evaluated the hypotheses separately for each of the was much harder to get right by chance.
dependent variables. Each of the variables was tested for B. Internal Validity
significance using a non-parametric Mann-Whitney U test.
In order to avoid maturation or learning effects, subjects were
Additionally, we tested for equality of variances for each
only allowed to participate in the experiment once and only
of the variables using a Levene test in order to fulfill the
in one group, and were not allowed to exchange information
assumptions of the Mann-Whitney U test. For both tests, we
with other subjects during the experiment. Additionally, we
used a significance value of 0.05.
used a pre-experiment questionnaire in order to assess the
E. Operation subjects domain and modelling knowledge, which might affect
The experiment was piloted with two PhD students prior to the outcome. While all students came from the same course,
execution. The instrument turned out to be too complicated they had different previous experience with respect to software
and was therefore simplified furthermore to its current form. modelling and requirements engineering. We also assured that
The experiment was conducted in a 90-minute lecture. the subjects voluntarily participated in the experiment, by not
Participation was voluntary and the students received no giving rewards in the form of improved course grades or similar,
benefits for the modelling course, such as bonus points or higher in order to avoid compensation rivalry or demoralisation.
grades. In the first 45 minutes, both visual modelling languages However, we can not entirely rule out that some subjects
were introduced. While this is a rather short time for introducing participated to win our appraisal later in the course. The fact
two new languages, we were limited to this time frame by that we used volunteers might bias the results, as they could
the course schedule. Additionally, the subjects had previous have been more motivated than the average.
knowledge in similar languages from the course, so that it C. External Validity
was possible to related the newly introduced languages to that We used parts of a real-life specification instead of a
knowledge. Prior to the introduction lecture, we already handed toy example for the experiment instrument. However, the
out the experimental objects, so that the subjects knew which requirements had to be abstracted as the original specification
treatment they would receive and could concentrate on that is confidential. Additionally, while modelling the requirements,
language during the lecture. Additionally, they could familiarise we had to ensure that both treatments were modelled in the
themselves with the model. The subjects were encouraged same way and exhibited the same behaviour. This could have
not to share or exchange the objects with each other. After lead to one of the treatments being modelled in a way which
the introduction lecture, we handed out the remaining parts would not happen in practice, and thus limit generalisability.
of the instrument, namely the questionnaires and the syntax We tried to reduce this threat by iteratively discussing and
help. Subjects then received 3 minutes for filling out the pre- improving the instrument among the authors of this paper.
experiment questionnaire, 40 minutes to fill out the experiment Additionally, the fact that we used student subjects possibly
questionnaire, and finally 2 minutes for the post-experiment limits the generalisability to an industrial context. Finally,
questionnaire. the specification is based on an automotive requirements
V. VALIDITY specification, which can limit the generalisability to other
domains.
We will in the following discuss means which we took in
order to ensure validity. We use the four aspects of validity as D. Conclusion Validity
presented in Wohlin et al. [20]. We tried to avoid ambiguous wording of questions in
the questionnaire by iteratively reviewing and improving it.
A. Construct Validity
Additionally, we performed a pilot experiment with two PhD
In order to avoid inadequate preoperational explication of students prior to the actual experiment, in order to improve
constructs, we have explicitly defined what ’comprehensibility’ both the introduction material and the questionnaire. Reliability
means with respect to our study. Also, it is clearly defined that of treatment implementation is given, as the introduction lecture
a higher score in any of the three dependent variables means a was only given once for the actual experiment. We did only
better result for that variable. Our dependent variables do not perform statistical tests on the three dependent variables, which
require any human judgment and are therefore objective. Mono- were defined up-front, and did not fish for results [20].
operation bias can currently not entirely be ruled out, as we
only used one experimental object. We are planning to replicate VI. R ESULTS AND D ISCUSSION
the experiment with another requirements specification in the In the following, we will discuss first the demography of
future in order to address this. Mono-method bias is addressed the subjects participating in the experiment. Afterwards, we
by asking two sub-questions for each of the 12 experiment present and discuss the results of the hypothesis testing for
questions. While the first of the two sub-questions is a simple the experiment. Finally, we finish with a discussion of the
yes/no question, it is an additional check whether the subject post-experiment questionnaire.

21
A. Demographic Data led to only one subject finishing all questions. In the MSD
Out of the 22 subjects, 19 are Bachelor students and 3 are group, half of the subjects finished all questions. Additionally,
Master students. This can be explained through the fact that the four subjects in the TA group answered three or less questions,
course in which we performed the experiment is on Bachelor whereas this is only the case for one subject in the MSD group.
level, but can be taken as an elective course by first year The large difference in the two means for this variable already
Master students. All 3 Master students were randomly assigned indicates that the null hypothesis can be rejected, which is
the MSD treatment. Out of 22 subjects, 13 have a secondary confirmed by the significance test with p ≈ 0.021. Hence,
school degree, 7 a Bachelor degree, 1 a Master degree, and there is a significant difference with respect to the number of
1 subject another degree as their highest degree. This means answered questions between the two treatments. A possible
that 5 subjects on Bachelor level are already in possession of a explanation for this might be the nature of MSDs, compared
Bachelor degree, and one Master student already has a Master to TAs. While a single MSD has to be taken into account
degree. While this is certainly possible, it might also be caused only once it is activated, each automaton in a TA is ’active’
by misunderstanding the question. Most subjects already had by definition. This means that for each message in a given
previous courses on related topics, such as Object-oriented scenario, all automata need to be studied, while only a subset
programming or Software Architecture. Only 6 subjects stated of the MSDs needs to be considered.
to not have taken any related courses previously. Additionally,
we asked the subjects for their professional experience in Answered
(TA)
developing software, in modelling software, and in requirements 12

engineering. In both modelling software and in requirements 10
engineering, only 3 subjects answered that they had previous 8
professional experience, ranging from half a year to three years
6
of experience. In addition to this, 9 subjects stated that they
4
have professional experience in software development, with one
subject each stating 0.3 years, 1 year, and 8 years of experience, 2

and 3 subjects each stating 2 and 3 years of experience. 0
1
2
3
4
5
6
7
8
9
10
11
12
Subject
B. Experiment Results
The experiment was conducted on 4th December 2014 at Answered
(MSD)
Chalmers University in Gothenburg, Sweden. The answers 12

from the paper questionnaire were afterwards digitalised in 10
order to allow computerised data processing. An overview over
8
both the descriptive statistics and the significance testing for
6
all three variables is depicted in Tables I and II.
4

TABLE I 2
D ESCRIPTIVE S TATISTICS OF THE E XPERIMENT 0
1
2
3
4
5
6
7
8
9
10
Treatment Dependent variable Mean Standard Subject
deviation
TA Answered 5.667 3.42
AScore 0.693 0.726 Fig. 4. Answered of TA an MSD treatment
Score 5.583 7.669
MSD Answered 9.4 3.273
AScore 0.576 0.45 The second dependent variable, AScore, is depicted in Figure
Score 6.2 5.453 5 for both TA and MSD treatment. Here, in the TA treatment
there is a much larger variance in the data set, with both
very high and very low values. For the MSD treatment, there
TABLE II are few values in the extremes. The statistical test results in
S IGNIFICANCE T ESTING OF THE E XPERIMENT p ≈ 0.947, so that the null hypothesis can not be rejected
Dependent Significance Significance H0 rejected for this variable. We do not have an explanation for the large
variable Level Levene Level Mann- differences between subjects in the TA treatment, but they
Whitney U might be attributed to misunderstandings with respect to the
Answered p ≈ 0.94 p ≈ 0.021 Yes
AScore p ≈ 0.097 p ≈ 0.947 No modelling language. Several subjects achieved average scores
Score p ≈ 0.707 p ≈ 0.464 No under 1 point, even though they stated in the post-experiment
questionnaire that they were confident in their answers. We
The results of the first dependent variable, Answered, are plan to replicate the experiment in the future which will include
depicted in Figure 4 for both treatments. Clearly, subjects in some simple upfront questions in order to measure whether
the TA group took longer to answer the questionnaire, which the subjects have really understood the languages well enough

22
and analyse whether this correlates with the self-assessment. μ
=
5.583
Score
(TA)
σ2
=
58.81
24
21
μ
=
0.693
20
AScore
(TA)
σ2
=
0.526
20
2
16
1.8
1.6
12
9
9
1.4
1.2
8
1
3
4
2
2
0.8
1
0
0
0
0
0.6
0
0.4
1
2
3
4
5
6
7
8
9
10
11
12
0.2
Subject
0
1
2
3
4
5
6
7
8
9
10
11
12
μ
=
6.2
Subject
Score
(MSD)
σ2
=
29.73
24
μ
=
0.576
AScore
(MSD)
σ2
=
0.202
20
2
16
16
14
1.8
1.6
12
1.4
9
8
7
1.2
6
5
1
4
2
2
0.8
1
0
0.6
0
0.4
1
2
3
4
5
6
7
8
9
10
0.2
Subject
0
1
2
3
4
5
6
7
8
9
10
Subject
Fig. 6. Score of TA and MSD treatment

Fig. 5. AScore of TA and MSD treatments TABLE III
C ORRELATIONS BETWEEN D EPENDENT VARIABLES AND D EMOGRAPHICS
As the third variable Score is directly computed from AScore
Dependend Previous Courses Bachelor/Master Confidence
and Answered, it exhibits a similar pattern (see Figure 6). In the Variable r p r p r p
TA group, two subjects achieved 20 or more points, close to the Answered 0.207 0.355 −0.247 0.267 0.56 0.007
maximum of 24. However, many subjects in this group have low AScore 0.122 0.588 0.116 0.606 0.513 0.015
Score 0.17 0.45 0.074 0.745 0.557 0.007
total scores. As subjects in the MSD group have significantly
higher values in the Answered metric, their average Score
values are higher, even though the average score AScore is
lower for this group. average, a clear grasp of whether they understood the instrument
In summary, we can state that MSDs are significantly quicker or not. Interestingly, both the number of previous courses
to comprehend. Therefore, if speed is a relevant factor, MSDs and the education level only show a small correlation with
should be chosen instead of TA. One could argue that speed the dependent variables. Similarly, previous experience in
itself is not relevant, as long as AScore is low. Therefore, we Software Development, Software Modelling, and Requirements
plan to replicate the experiment with subjects who are more Engineering has small correlation with the dependent variables,
familiar with the modelling languages, in order to see whether as depicted in Table IV. These results could indicate that
the difference in speed is still present. the dependent variables were in fact influenced by other
factors, such as confusion regarding the newly introduced
C. Correlation between Demographic Data and Dependent modelling languages. However, they could also indicate that
Variables the understanding of the requirements is not dependent on
We used the Pearson product-moment correlation coefficient previous education and experience. Further replications will be
to assess the correlations between the three dependent variables necessary in order to answer these questions in a satisfactory
and the number of related courses previously taken by manner.
students, the education level (Bachelor/Master), and the
subject’s confidence in their answers. The resulting values
TABLE IV
for Pearson’s r and the p-value are depicted in Table III. C ORRELATIONS BETWEEN D EPENDENT VARIABLES AND E XPERIENCE
Assuming an effect size of r < 0.3 as small, an effect size of
0.3 ≤ r < 0.4 as medium, and an effect size of r ≥ 0.4 Dependend Software Dev. Modelling Req. Eng.
Variable r p r p r p
as large, we see that there is a large correlation between Answered 0.08 0.724 0.054 0.811 0.092 0.685
all three dependent variables and the subject’s confidence AScore −0.044 0.846 0.053 0.816 0.063 0.781
in their results. This result indicates that subjects had, in Score −0.09 0.692 0.028 0.901 0.039 0.863

23
VII. C ONCLUSIONS AND F UTURE W ORK R EFERENCES
[1] M. C. Otero and J. J. Dolado, “Evaluation of the comprehension of
In this paper, we have presented the results of a controlled the dynamic modeling in UML,” Information and Software Technology,
experiment with 22 students in an undergraduate course on vol. 46, no. 1, pp. 35–53, 2004.
software modelling. We studied the comprehensibility of [2] C. Glezer, M. Last, E. Nachmany, and P. Shoval, “Quality and com-
prehension of uml interaction diagrams-an experimental comparison,”
functional requirements modelled in two graphical languages, Information and Software Technology, vol. 47, no. 10, pp. 675–692,
Modal Sequence Diagrams, a sequence-based notation, and 2005.
Timed Automata, a state-based notation. Subjects received a [3] A. Nugroho, “Level of detail in uml models and its impact on model
comprehension: A controlled experiment,” Information and Software
model in one of the two languages and a questionnaire with Technology, vol. 51, no. 12, pp. 1670–1685, 2009.
questions testing their understanding of the model. While we [4] M. Staron, L. Kuzniarz, and C. Wohlin, “Empirical assessment of
can not reject the null hypothesis, that there are no significant using stereotypes to improve comprehension of uml models: A set
of experiments,” Journal of Systems and Software, vol. 79, no. 5, pp.
differences between the two treatments, for both the average 727–742, 2006.
and the total questionnaire scores, subjects receiving the Modal [5] N. Condori-Fernández, M. Daneva, K. Sikkel, R. Wieringa, O. Dieste,
Sequence Diagram specification answered significantly more and O. Pastor, “A systematic mapping study on empirical evaluation of
software requirements specifications techniques,” in Proceedings of the
questions. This indicates that if the speed or the efficiency plays 2009 3rd International Symposium on Empirical Software Engineering
an important role, scenario-based models should be considered and Measurement. IEEE Computer Society, 2009, pp. 502–505.
instead of the state-based models. However, further studies [6] N. Condori-Fernández, M. Daneva, K. Sikkel, and A. Herrmann,
“Practical relevance of experiments in comprehensibility of requirements
need to be conducted in order to understand whether this specifications,” in Empirical Requirements Engineering (EmpiRE), 2011
effect persists with more experienced users who achieve higher First International Workshop on, Aug 2011, pp. 21–28.
overall scores. [7] S. Abrahão, C. Gravino, E. Insfran, G. Scanniello, and G. Tortora,
“Assessing the effectiveness of sequence diagrams in the comprehension
While our sample of students without a previous knowledge of functional requirements: Results from a family of five experiments,”
of the used treatments can be seen as a possible threat to Software Engineering, IEEE Transactions on, vol. 39, no. 3, pp. 327–342,
March 2013.
validity, this lack of experience is in fact a realistic setup [8] D. Harel and S. Maoz, “Assert and negate revisited: Modal semantics
for industrial use in the automotive domain. As requirements for UML sequence diagrams,” Software and Systems Modeling (SoSyM),
specifications are used across organisations and across roles vol. 7, no. 2, pp. 237–252, May 2008.
[9] R. Alur and D. L. Dill, “A Theory of Timed Automata,” Theoretical
within an organisation, it can not be assumed that the receiver Computer Science, vol. 126, no. 2, pp. 183–235, 1994.
of a specification is always familiar with every detail of the used [10] K. G. Larsen, M. Mikucionis, B. Nielsen, and A. Skou, “Testing real-
language. Additionally, receivers are often no experts in mod- time embedded software using uppaal-tron: An industrial case study,”
in Proceedings of the 5th ACM International Conference on Embedded
elling, but in other areas such as requirements engineering or Software. ACM, 2005.
system design. Therefore, in contrast to, for example, software [11] A. Fehnker, “Scheduling a steel plant with timed automata,” in rtcsa.
development, the receivers of a requirements specification can IEEE, 1999, p. 280.
[12] J. Greenyer, M. Haase, J. Marhenke, and R. Bellmer, “Evaluating a
not be expected to be experts in the used language. Additionally, formal scenario-based method for the requirements analysis in automotive
our results indicate that the current practice, choosing the software engineering,” in Proceedings of the 2015 10th Joint Meeting
modelling language based on convenience, is not a threat to on Foundations of Software Engineering. ACM, 2015.
[13] O. M. Group, “Unified modeling language,” http://www.uml.org/, Jun.
the comprehension of the specifications in itself. 2014.
In the future, we will replicate the experiment both with [14] E. Kamsties, A. von Knethen, and R. Reussner, “A controlled experiment
different groups of students and with professionals from our to evaluate how styles affect the understandability of requirements
specifications,” Information and Software Technology, vol. 45, no. 14,
industrial partners in order to eliminate possible bias and to pp. 955–965, 2003, eighth International Workshop on Requirements
assess whether experience and a deeper knowledge of the Engineering: Foundation for Software Quality.
languages can have a significant impact on the understanding. [15] G. Scanniello, M. Staron, H. Burden, and R. Heldal, “On the effect of
using SysML requirement diagrams to comprehend requirements: Results
Additionally, we will aim at generating a theory on which from two controlled experiments,” in 18th International Conference on
languages are suitable for which kind of task or system when Evaluation Assessment in Software Engineering (EASE), May 2014, pp.
modelling requirements. 433–442.
[16] W. Damm and D. Harel, “LSCs: Breathing life into message sequence
charts,” in Formal Methods in System Design, vol. 19. Kluwer Academic,
ACKNOWLEDGEMENT 2001, pp. 45–80.
[17] D. Harel and R. Marelly, Come, Let’s Play: Scenario-Based Programming
Using LSCs and the Play-Engine. Springer, August 2003.
We would like to express our gratitude to Nadja Marko [18] J. Bengtsson and W. Yi, “Timed automata: Semantics, algorithms and
and Christian Webel, who helped in reviewing and discussing tools,” in Lectures on Concurrency and Petri Nets, vol. 3098. Springer,
an early experiment design. Additionally, we would like to 2003, pp. 87–124.
[19] V. R. Basili, “Software modeling and measurement: The
thank Pariya Kashfi and Vard Antinyan for participating in goal/question/metric paradigm,” Tech. Rep., 1992.
the pilot experiment. The research leading to these results has [20] C. Wohlin, P. Runeson, M. Höst, M. C. Ohlsson, and B. Regnell,
received partial funding from the European Union’s Seventh Experimentation in Software Engineering. Springer, 2012.
[21] W. Damm and D. Harel, “LSCs: Breathing life into message sequence
Framework Program (FP7/2007-2013) for CRYSTAL-Critical charts,” in Formal Methods in System Design, vol. 19. Kluwer Academic,
System Engineering Acceleration Joint Undertaking under grant 2001, pp. 45–80.
agreement No 332830 and from Vinnova under DIARIENR
2012-04304.