Comparing Comprehensibility of Modelling
Languages for Specifying Behavioural Requirements
                                  Grischa Liebel                                          Matthias Tichy
                        Software Engineering Division                                Ulm University, Germany
                  Chalmers | University of Gothenburg, Sweden                        matthias.tichy@uni-ulm.de
                              grischa@chalmers.se


   Abstract—The selection of a suitable modelling language influ- sequence-based notations (MSDs) and state-based notations
ences the success of software modelling. Several experiments com- (TAs), as they are similar in terms of expressiveness and hence
paring the comprehensibility of graphical modelling languages     possible alternatives for expressing the same requirements, and
have been published. However, no published study comparing
the comprehensibility of functional requirements modelled in      as both have been applied in industrial case studies, e.g. [10],
different graphical modelling languages exists. This paper eval-  [11], [12]. Additionally, both languages were already used in a
uates how two requirements modelled in a sequence-based nota-     joint project with industrial partners. The experiment is based
tion, Modal Sequence Diagrams, and in a state-based notation,     on an extensive and detailed requirements specification by a
Timed Automata, compare with respect to comprehensibility. A      vehicle manufacturer that defines the behaviour of a software
controlled experiment with 22 student from an undergraduate
course on software modelling was performed. Our results show      function to be realised by a supplier. Hence, the requirements
no significant differences with respect to the comprehensibility  specification is quite detailed and as such a reasonable candidate
of the two different languages, but subjects who answered the     for modelling. We used 22 undergraduate students in a course
questionnaire for the sequence-based notation completed signifi-  on software modelling as subjects. Our results show no
cantly more answers in the given time limit. These initial resultssignificant difference with respect to the comprehensibility of
indicate that choosing a modelling language for requirements
modelling based on convenience does not significantly affect the  requirements modelled in the two languages, but requirements
understanding of the resulting requirements.                      modelled in MSDs are significantly quicker to understand. This
                                                                  indicates that the current practice, selecting visual languages
                        I. I NTRODUCTION                          based on convenience, is in fact feasible with respect to the
   Choosing a visual modelling language in practice is typically comprehension of the resulting requirements specifications.
dependent on previous experience with the modelling language However, it might take longer to understand the requirements
or the availability of the respective modelling tools. While both depending on the chosen language.
aspects are certainly reasonable, other criteria are similarly       The remainder of this paper is structured as follows. In
important.                                                        Section II, related literature is discussed. Section III covers
   Comprehensibility is a commonly evaluated criteria of the basics of the two visual modelling languages we used in
visual languages, e.g., for UML diagrams in [1], [2], [3], [4]. the experiment. Section IV describes the experiment design,
However, evaluation results can be contradicting as in [1] and followed by a discussion of validity threats in Section V.
[2]. Similarly to visual languages, the comprehensibility or Section VI presents the actual results and discusses them in
understandability of software requirements specifications is depth. The paper is concluded in Section VII.
the most common aspect evaluated in empirical requirements
engineering studies [5]. At the same time, their practical                               II. R ELATED W ORK
value is often questioned [6]. A recent family of experiments
by Abrahão et al. reports that providing sequence diagrams          In the context of the UML [13], a number of experimental
together with a natural language specification increases com- studies have been published that compare different modelling
prehensibility [7]. Additionally, the authors state that possible languages with respect to comprehensibility. The compre-
future work in this area could be “experiments to analyse the hensibility of UML behavioural diagrams, namely sequence,
effect of different behavioural diagrams in the comprehension collaboration, and state machine diagrams, in both real-time
of software models”.                                              and management information systems is compared using a
   As a first step in this direction, we conducted a controlled controlled experiment by Otero and Dolado [1]. The results
experiment in order to understand which behavioural diagrams show that sequence diagrams are more comprehensible for real-
perform superior to others for the specific case of modelling time systems than for management information systems. With
functional requirements with respect to comprehensibility. respect to the answering speed, their data shows that sequence
Specifically, we compared two modelling languages, Modal diagrams perform better than collaboration and state machine
Sequence Diagrams (MSDs) [8] and Timed Automata (TA) [9]. diagrams for both domains. As subjects, 31 undergraduate
We chose these two languages as they are representatives for students are used in their study.


                                                                 17
   In contrast to this, Glezer et al. report that sequence diagrams visual modelling languages. However, the outcomes vary and
are more comprehensible for management information systems are sometimes even contradicting, e.g. in [1] and [2].
than for real-time systems [2]. The authors mainly attribute           Additionally, we are not aware of any experiment comparing
this difference to the previous knowledge of the subjects, who requirements represented by behavioural models only. This
were not experienced in real-time systems. In this study, the is a gap in knowledge, as requirements are typically on a
76 student subjects performed the experiment in terms of a more abstract level than for example software design and are,
mandatory mid-term exam.                                            additionally, often intended to be read and understood by non-
   Nugroho investigates the impact of detail on the compre- experts. Particularly, in the automotive domain, it is the usual
hension of UML Class, Sequence, Package, and Use Case process that detailed requirements specificions covering the
diagrams in form of a controlled experiment with 53 graduate behaviour of software components are defined by the vehicle
students [3]. The author reports that a low level of detail can manufacturer and subsequently sent to the supplier who needs
lead to misinterpretations and that the subjects’ knowledge did to correctly understand and realise the specified behaviour. We
not have an impact on the comprehension.                            are filling this gap with our contribution in this paper.
   Staron et al. report the results from four controlled ex-
periments studying the impact of using UML stereotypes                                     III. BACKGROUND
on comprehensibility conducted with 68 students and 4                  In the following, we introduce the two compared modelling
professionals in total [4]. The studies show that stereotypes languages and illustrate them using sample models from our
indeed improve the comprehensibility and the total and relative experiment. The models specify the behaviour for the case
times for answering the used questionnaires.                        that a user wants to increase the speed of a wiper by one unit.
   Similar to the UML, comprehensibility is a commonly Since both languages basically employ the same modelling
studied characteristic of requirement specifications. Condori- notations to specify real-time aspects, we did not use any real-
Fernández et al. present an evaluation of empirical studies time aspects but instead focused on non real-time behaviour.
until 2008 on requirements comprehensibility [6]. The authors Furthermore, a pilot of the experiment showed that the real-
conclude that while comprehensibility studies are common, time aspects were too difficult to understand for the planned
many of them have practical limitations, such as using made- experiment. We plan a future experiment specifically targeting
up examples instead of real specifications.                         the real-time aspects.
   Kamsties et al. study how different specification techniques
affect the comprehensibility of a software requirements specifi- A. Modal Sequence Diagrams
cation, using a re-engineered specification of a bicycle computer      Modal Sequence Diagrams (MSDs) [8] are a recent variant
[14]. The authors report that black-box specification techniques, of Live Sequence Charts (LSCs) [16] to model the behaviour
describing a system by its externally visible behaviour, lead of a set of objects. MSDs/LSCs are sequence diagrams that,
to a faster and more correct answering of the used instrument by different modalities assigned to messages and conditions,
than white-box specification techniques, where the system is allow to precisely describe scenarios with liveness (something
described by the behaviour between its entities.                    good must happen) and safety (something bad must not
   Finally, there are a number of studies which investigate the happen) properties. Notably, LSCs and MSDs define how
comprehensibility of requirements modelled in or enhanced multiple scenarios can be active concurrently and synchronise
with visual modelling languages. Scanniello et al. study the on common events as well as activate and de-activate MSDs.
effect on requirements comprehensibility when using SysML This allows engineers to flexibly specify systems that fulfill
diagrams in addition to natural language, compared to only different tasks at the same time.
natural language requirements [15]. The authors use students           One key advantage of MSDs/LSCs is that they can be
as subjects in two controlled experiments and report that executed with the play-out algorithm, which allows engineers
comprehensibility is increased when SysML diagrams are and other stakeholders to understand the behaviour emerging
provided, whereas completion time for the comprehension from the interplay of the scenarios [17]. Furthermore, it is
task is unaffected.                                                 possible to analyse whether a set of scenarios can be realised,
   A recent paper by Abrahão et al. reports a family of five i.e., it does not contain contradictions or results in deadlocks.
experiments on the comprehensibility of functional require-            Figure 1 shows a sample MSD used in our experiment. It
ments modelled with sequence diagrams in addition to the specifies the communication between a user, a wiper controller
natural language specification [7]. Hereby, one experiment uses as well as the actual wiper actuator. The sequence in the figure
undergraduate students, two experiments use master students, describes that (1) a request is sent to the wiper controller
one experiment uses doctoral students and one experiment uses to increase the speed, (2) it is checked whether the wiper is
professionals as subjects. Four out of five experiments show in the state active, and (3) the controller sends a message
statistically significant support for improved comprehensibility to the actuator to increase the speed by one. If the check in
when using sequence diagrams.                                       step 2 fails, the MSD will be de-activated and not further
   In summary, a number of experiments exist that investigate executed. Once the first message in an MSD is executed
the comprehensibility of visual modelling languages, of re- (wiperRequest(WiperRequest::WIPER INCREASE) in Figure
quirements specifications, and of requirements represented in 1), it is called active. After the last message, the MSD is


                                                               18
     MSD Start_Increase                                                                          IV. E XPERIMENT D ESIGN
                                              wiper:          act:
 usr: User                                WiperController WiperActuator           The evaluation of comprehensibility of the two considered
      wiperRequest(WiperRequest::WIPER_INCREASE)                          0
                                                                               modelling languages used for requirements engineering was
     wiper.wiperState == WiperState::WIPER_ACTIVE
                                                                               performed using a controlled experiment. The goal of this exper-
                                                                          1    iment is formulated as follows, using the Goal/Question/Metric
                                                  addToCurrentSpeed(1)
                                                                               paradigm [19]:
                                                                                  • Analyse requirements modelled in two different modelling

                           Fig. 1. Sample MSD                                        languages for the purpose of comparison with respect to
                                                                                     comprehensibility from the point of view of software de-
                                                                                     velopers in the following context: application (verification
                           WiperRequest?                                             and validation), subjects (students).
           wiperRequestSignal==WiperRequest_INCREASE
           && Actuator_WiperState==WiperState_ACTIVE                           We used a between-subject randomised design with two
                                                                               treatments [20]. The between-subject design was chosen to
s1                                                            s2               avoid learning effects. The treatments are the used modelling
                                                                               language, namely MSD and TA. MSDs, a variant of Live
                  addToCurrentSpeedSignal==1                                   Sequence Charts [21], are sequence diagrams with assigned
         Actuator_WiperSpeed := Actuator_WiperSpeed +1
                        AddToCurrentSpeed?
                                                                               modalities that allow the expression of liveness and safety
                                                                               properties, and real-time constraints. Timed Automata are
                                                                               a modification of Finite Automata for the specification and
                                          addToCurrentSpeedSignal!=1           verification of real-time systems. Hence, both languages are
                                  error
                                              AddToCurrentSpeed?               very similar in terms of expressiveness. However, MSDs use
                          false
                                                                               a scenario-based description covering multiple objects in one
                            Fig. 2. Sample TA
                                                                               MSD, whereas TA use a state-based description covering a
                                                                               single object in one TA. MSDs were chosen in order to have a
                                                                               sequence-based language with executable semantics, in contrast
deactivated again. The numbers on the right side describe the                  to UML Sequence Diagrams, and with the possibility to model
so-called cut, the positions in which an MSD can be.                           required and forbidden behaviour. In order to not introduce
  The complete MSD model consists of a set of five scenarios                   any bias, we chose TA as a second modelling language, as
covering the communication and conditions for the three                        the language had not either been introduced during the course.
mentioned objects.                                                             Both languages are used without their timing functionality.
                                                                               Subjects were assigned randomly one of the two treatments. In
                                                                               the following subsections, the details of the experiment design
B. Timed Automata                                                              are presented.
   Timed automata [18] are a state-based formalism which                       A. Subjects
extends finite automata with a set of real-valued variables called
                                                                                  We performed the experiment with 22 students from an
clocks as well as various real-time constraints. Several timed
                                                                               undergraduate course on software modelling. This is due to
automata can be combined into a network of timed automata
                                                                               availability reasons, as we had a scheduled university course
where different automata synchronise their behaviour by, so
                                                                               in the end of 2014 in which we could perform the experiment.
called, synchronisation channels. Synchronisation channels can
                                                                               The students had basic knowledge of UML, as the experiment
be used as a means to specify synchronous message passing.
                                                                               was performed towards the end of the course. Both modelling
Timed automata can be both simulated as well as verified for
                                                                               languages were only introduced prior to the experiment in a
correctness using model checking.
                                                                               single 45-minute lecture. However, the students were introduced
   Figure 2 shows the timed automaton for the wiper con-                       to similar languages earlier in the course, namely to UML
troller covering the increase wiper speed scenario as de-                      sequence diagrams and to UML state machine diagrams.
scribed previously for the MSD. It defines that if a wiper
request for increasing the speed (condition: wiperRequestSig-                  B. Instrumentation
nal==WiperRequest INCREASE) using the synchronisation                             As a basis for the experimental objects, which we used in the
channel WiperRequest? is received and the wiper is active                      study, we selected requirements from a real-life project within
(condition: Actuator WiperState==WiperState ACTIVE), then                      the automotive domain from an industrial partner. The selected
the wiper speed is increased by 1 and a helper variable is set to              requirements describe joint behaviour, i.e. the requirements are
1 (addToCurrentSpeedSignal:=1). This helper variable is used                   not entirely independent. As these requirements are confidential,
in another automaton for a long-press functionality.                           we abstracted them and changed their actual content resembling
   The complete TA model consists of a network of five timed                   a car wiper specification. However, we ensured that the
automata covering the communication and conditions for the                     complexity and the logic is comparable. These requirements
three mentioned objects.                                                       were then modelled by the main author of this paper using


                                                                              19
MSD and TA. The resulting experimental objects consist of            Precondition:           Actuator_WiperSpeed                 = Constants_SLOW
                                                                                             Actuator_WiperState                 = WiperState_ACTIVE
two requirements models, SM SD and ST A , consisting of five                                 Wiper_VehicleStatus                 = VehicleStatus_RUNNING
                                                                                             Wiper_WiperConfiguration            = WiperConfig_INSTALLED!
diagrams each. The diagrams specify the activation of a car                             1. WiperRequest is triggered, with wiperRequestSignal set to
wiper in slow mode and in fast mode, the increase of the                                     WiperRequest_OFF
                                                                                        2. SetWiperSpeed is triggered, with setWiperSpeedSignal set to
wiper’s speed in two different ways, and the deactivation of                                 Constants_OFF

the wiper. Additionally, the experimental objects contained a        Question 1:   Does the input scenario violate the specified behaviour?
                                                                     Answer 1:     No , Yes, in step: 1 ☐, 2 ☐
single page describing the context of each treatment. For the
                                                                     Question 2:   Which values do the following variables have
MSD specification, this consisted of a UML class diagram and                       - after the execution of the input scenario (if A1 is ‘No’)
an UML object diagram, and for the TA specification, this                          - before the violating step (if A1 is ‘Yes’)?
                                                                     Answer 2:
page contained the system declarations.                                            Actuator_WiperSpeed                 = Constants_OFF
                                                                                   Actuator_WiperState                 = WiperState_ACTIVE
   Finally, the instrument contained one page of syntax and                        Wiper_VehicleStatus                 = VehicleStatus_RUNNING
                                                                                   Wiper_WiperConfiguration            = WiperConfig_INSTALLED
semantic explanation additionally to the introduction lecture
and a questionnaire. In turn, the questionnaire consisted of a pre- Precondition:            act.wiperSpeed
                                                                                             act.wiperState
                                                                                                                        = Constants.SPEED_SLOW
                                                                                                                        = WiperState::WIPER_ACTIVE
experiment part, collecting demographic data about the subjects                              wiper.vehicleStatus        = VehicleStatus::RUNNING
                                                                                             wiper.configuration        = WiperConfig::WIPER_INSTALLED
(including subjects’ knowledge regarding modelling languages),                          1. usr sends Message ‘wiperRequest(WiperRequest::WIPER_OFF)’ to
a post-experiment part, collecting subjective judgment, and the                              wiper
                                                                                        2. wiper sends Message ‘setWiperSpeed(Constants.SPEED_OFF)’ to
actual measurement questionnaire consisting of 12 questions                                  act

targeting the subjects’ understanding. The pre- and post- Question 1: Does the input scenario violate the specified behaviour?
                                                                     Answer 1:     No , Yes, in step: 1 ☐, 2 ☐
experiment questionnaires were used to judge whether previous
                                                                     Question 2:   Which values do the following variables have
experience, understanding of the introduction lectures, or other                   - after the execution of the input scenario (if A1 is ‘No’)
factors might have affected the dependent variables. Due to                        - before the violating step (if A1 is ‘Yes’)?
                                                                     Answer 2:
space limitations, we only discuss the data obtained from these                    act.wiperSpeed             = Constants.SPEED_OFF
                                                                                   act.wiperState             = WiperState::WIPER_ACTIVE
questionnaires briefly in Section VI. Each of the 12 questions                     wiper.vehicleStatus        = VehicleStatus::RUNNING
                                                                                   wiper.configuration        = WiperConfig::WIPER_INSTALLED
consisted of an initial state of the system and a number of
executed messages or commands. Then, 2 sub-questions were
                                                                     Fig. 3. Example Question for TA Model (above) and MSD Model (below)
asked. The subjects first had to answer whether the execution
violated the requirements or not. Additionally, we asked in
which state the system was after the execution (or right before think that an accurate understanding of a specification is more
the requirements violation), either by asking for the system’s important than speed. This is why we chose AScore as a metric
variable values or by asking for the active cuts/states of each for measuring how correct a question is answered in average.
diagram. Both sub-questions were awarded with one point For completeness, we also added Score, which is related
each. The second sub-question was only counted if the first to the other two metrics by Score = Answered ∗ AScore.
sub-question was correct, as it was otherwise already clear We opted for comprehensibility instead of letting subjects
that the subject had wrongly executed the requirements. An create diagrams themselves, as this is easier and requires
example question with solutions for both the MSD and the TA less training. Furthermore, the experiment targets models of
model is depicted in Figure 3.                                      functional requirements, not simply behavioural models in
   This questionnaire approach has been successfully applied general. Therefore, we argue that comprehensibility is of
in many similar studies, e.g. in [7], [15], [1]. The instrument, particular importance, as the aim of requirements is to document
together with the resulting raw data, is published at http://www. what a system shall fulfill. Hence, correctly understanding these
grischaliebel.de/data/research/instrument exp msd ta.zip.           requirements is crucial.
C. Variables                                                           An additional variable which can influence the outcome of
                                                                    the experiment is the subjects’ knowledge regarding modelling
   There is only a single independent variable in the performed languages and their domain knowledge in the automotive
experiment. This is the used visual modelling language with domain. While all students are from the same course, they
the values MSD or TA. We measured the comprehensibility might have different previous knowledge and experience. To
of the used requirements specification using three dependent address this issue we employed a pre-experiment survey which
variables:                                                          asked for background information, such as previous courses
Answered: The number of answered questions.                         on modelling taken by the subject.
AScore: The average score achieved per answered question.
Score: The total score achieved for all 12 questions.               D. Hypotheses
Instead of measuring the time, we decided to design the                In the course of the experiment, we used the following null
instrument in a way that it would be difficult to answer all        and   alternative hypotheses, H0 and H1 , which we formulated
questions in the given time frame. Therefore, we use the number     as  follows.
of answered questions, Answered, instead of the needed time.           • H0 : There is no significant difference between Modal
We are foremost interested in using modelling languages for                Sequence Diagrams and Timed Automata with respect to
verification and validation purposes later on. Therefore, we               comprehensibility of requirements specifications.


                                                                          20
  •  H1 : There are significant differences between Modal has correctly understood the model. If this one is already
     Sequence Diagrams and Timed Automata with respect to incorrect, we automatically awarded 0 points to the second
     comprehensibility of requirements specifications.             sub-question as well. Additionally, the second sub-question
We evaluated the hypotheses separately for each of the             was  much harder to get right by chance.
dependent variables. Each of the variables was tested for B. Internal Validity
significance using a non-parametric Mann-Whitney U test.
                                                                      In order to avoid maturation or learning effects, subjects were
Additionally, we tested for equality of variances for each
                                                                   only allowed to participate in the experiment once and only
of the variables using a Levene test in order to fulfill the
                                                                   in one group, and were not allowed to exchange information
assumptions of the Mann-Whitney U test. For both tests, we
                                                                   with other subjects during the experiment. Additionally, we
used a significance value of 0.05.
                                                                   used a pre-experiment questionnaire in order to assess the
E. Operation                                                       subjects domain and modelling knowledge, which might affect
   The experiment was piloted with two PhD students prior to the outcome. While all students came from the same course,
execution. The instrument turned out to be too complicated they had different previous experience with respect to software
and was therefore simplified furthermore to its current form. modelling and requirements engineering. We also assured that
   The experiment was conducted in a 90-minute lecture. the subjects voluntarily participated in the experiment, by not
Participation was voluntary and the students received no giving rewards in the form of improved course grades or similar,
benefits for the modelling course, such as bonus points or higher in order to avoid compensation rivalry or demoralisation.
grades. In the first 45 minutes, both visual modelling languages However, we can not entirely rule out that some subjects
were introduced. While this is a rather short time for introducing participated to win our appraisal later in the course. The fact
two new languages, we were limited to this time frame by that we used volunteers might bias the results, as they could
the course schedule. Additionally, the subjects had previous have been more motivated than the average.
knowledge in similar languages from the course, so that it          C. External Validity
was possible to related the newly introduced languages to that         We used parts of a real-life specification instead of a
knowledge. Prior to the introduction lecture, we already handed     toy example for the experiment instrument. However, the
out the experimental objects, so that the subjects knew which       requirements had to be abstracted as the original specification
treatment they would receive and could concentrate on that          is confidential. Additionally, while modelling the requirements,
language during the lecture. Additionally, they could familiarise   we had to ensure that both treatments were modelled in the
themselves with the model. The subjects were encouraged             same way and exhibited the same behaviour. This could have
not to share or exchange the objects with each other. After         lead to one of the treatments being modelled in a way which
the introduction lecture, we handed out the remaining parts         would not happen in practice, and thus limit generalisability.
of the instrument, namely the questionnaires and the syntax         We tried to reduce this threat by iteratively discussing and
help. Subjects then received 3 minutes for filling out the pre-     improving the instrument among the authors of this paper.
experiment questionnaire, 40 minutes to fill out the experiment     Additionally, the fact that we used student subjects possibly
questionnaire, and finally 2 minutes for the post-experiment        limits the generalisability to an industrial context. Finally,
questionnaire.                                                      the specification is based on an automotive requirements
                         V. VALIDITY                                specification, which can limit the generalisability to other
                                                                    domains.
  We will in the following discuss means which we took in
order to ensure validity. We use the four aspects of validity as    D. Conclusion Validity
presented in Wohlin et al. [20].                                     We tried to avoid ambiguous wording of questions in
                                                                  the questionnaire by iteratively reviewing and improving it.
A. Construct Validity
                                                                  Additionally, we performed a pilot experiment with two PhD
   In order to avoid inadequate preoperational explication of students prior to the actual experiment, in order to improve
constructs, we have explicitly defined what ’comprehensibility’ both the introduction material and the questionnaire. Reliability
means with respect to our study. Also, it is clearly defined that of treatment implementation is given, as the introduction lecture
a higher score in any of the three dependent variables means a was only given once for the actual experiment. We did only
better result for that variable. Our dependent variables do not perform statistical tests on the three dependent variables, which
require any human judgment and are therefore objective. Mono- were defined up-front, and did not fish for results [20].
operation bias can currently not entirely be ruled out, as we
only used one experimental object. We are planning to replicate                  VI. R ESULTS AND D ISCUSSION
the experiment with another requirements specification in the        In the following, we will discuss first the demography of
future in order to address this. Mono-method bias is addressed the subjects participating in the experiment. Afterwards, we
by asking two sub-questions for each of the 12 experiment present and discuss the results of the hypothesis testing for
questions. While the first of the two sub-questions is a simple the experiment. Finally, we finish with a discussion of the
yes/no question, it is an additional check whether the subject post-experiment questionnaire.


                                                                21
A. Demographic Data                                                  led to only one subject finishing all questions. In the MSD
   Out of the 22 subjects, 19 are Bachelor students and 3 are        group, half of the subjects finished all questions. Additionally,
Master students. This can be explained through the fact that the     four subjects in the TA group answered three or less questions,
course in which we performed the experiment is on Bachelor           whereas this is only the case for one subject in the MSD group.
level, but can be taken as an elective course by first year          The large difference in the two means for this variable already
Master students. All 3 Master students were randomly assigned        indicates that the null hypothesis can be rejected, which is
the MSD treatment. Out of 22 subjects, 13 have a secondary           confirmed by the significance test with p ≈ 0.021. Hence,
school degree, 7 a Bachelor degree, 1 a Master degree, and           there is a significant difference with respect to the number of
1 subject another degree as their highest degree. This means         answered questions between the two treatments. A possible
that 5 subjects on Bachelor level are already in possession of a     explanation for this might be the nature of MSDs, compared
Bachelor degree, and one Master student already has a Master         to TAs. While a single MSD has to be taken into account
degree. While this is certainly possible, it might also be caused    only once it is activated, each automaton in a TA is ’active’
by misunderstanding the question. Most subjects already had          by definition. This means that for each message in a given
previous courses on related topics, such as Object-oriented          scenario, all automata need to be studied, while only a subset
programming or Software Architecture. Only 6 subjects stated         of the MSDs needs to be considered.
to not have taken any related courses previously. Additionally,
we asked the subjects for their professional experience in                                                                            Answered	
  (TA)	
  
developing software, in modelling software, and in requirements          12	
  

engineering. In both modelling software and in requirements              10	
  
engineering, only 3 subjects answered that they had previous               8	
  
professional experience, ranging from half a year to three years
                                                                           6	
  
of experience. In addition to this, 9 subjects stated that they
                                                                           4	
  
have professional experience in software development, with one
subject each stating 0.3 years, 1 year, and 8 years of experience,         2	
  

and 3 subjects each stating 2 and 3 years of experience.                   0	
  
                                                                                   1	
       2	
             3	
             4	
              5	
      6	
         7	
       8	
             9	
             10	
         11	
     12	
  
                                                                                                                                                          Subject	
  
B. Experiment Results
   The experiment was conducted on 4th December 2014 at                                                                              Answered	
  (MSD)	
  
Chalmers University in Gothenburg, Sweden. The answers                   12	
  

from the paper questionnaire were afterwards digitalised in              10	
  
order to allow computerised data processing. An overview over
                                                                           8	
  
both the descriptive statistics and the significance testing for
                                                                           6	
  
all three variables is depicted in Tables I and II.
                                                                           4	
  

                               TABLE I                                     2	
  
             D ESCRIPTIVE S TATISTICS OF THE E XPERIMENT                   0	
  
                                                                                     1	
             2	
             3	
              4	
             5	
            6	
             7	
             8	
              9	
          10	
  
        Treatment    Dependent variable   Mean     Standard                                                                                               Subject	
  
                                                   deviation
        TA           Answered             5.667    3.42
                     AScore               0.693    0.726                                       Fig. 4. Answered of TA an MSD treatment
                     Score                5.583    7.669
        MSD          Answered             9.4      3.273
                     AScore               0.576    0.45                 The second dependent variable, AScore, is depicted in Figure
                     Score                6.2      5.453             5 for both TA and MSD treatment. Here, in the TA treatment
                                                                     there is a much larger variance in the data set, with both
                                                                     very high and very low values. For the MSD treatment, there
                              TABLE II                               are few values in the extremes. The statistical test results in
             S IGNIFICANCE T ESTING OF THE E XPERIMENT               p ≈ 0.947, so that the null hypothesis can not be rejected
  Dependent Significance        Significance        H0 rejected      for this variable. We do not have an explanation for the large
  variable  Level Levene        Level      Mann-                     differences between subjects in the TA treatment, but they
                                Whitney U                            might be attributed to misunderstandings with respect to the
  Answered p ≈ 0.94             p ≈ 0.021           Yes
  AScore   p ≈ 0.097            p ≈ 0.947           No               modelling language. Several subjects achieved average scores
  Score    p ≈ 0.707            p ≈ 0.464           No               under 1 point, even though they stated in the post-experiment
                                                                     questionnaire that they were confident in their answers. We
  The results of the first dependent variable, Answered, are         plan to replicate the experiment in the future which will include
depicted in Figure 4 for both treatments. Clearly, subjects in       some simple upfront questions in order to measure whether
the TA group took longer to answer the questionnaire, which          the subjects have really understood the languages well enough


                                                                  22
and analyse whether this correlates with the self-assessment.                                                                                                                                                                                                                                                                            μ	
  =	
  5.583	
  
                                                                                                                                                                                                                                                           Score	
  (TA)	
  
                                                                                                                                                                                                                                                                                                                                         σ2	
  =	
  58.81	
  
                                                                                                                                                                                   24	
  
                                                                                                                                                                                                                                                                                                                                                   21	
  
                                                                                                                                                 μ	
  =	
  0.693	
                                                                                                                                                                   20	
  
                                                                    AScore	
  (TA)	
                                                             σ2	
  =	
  0.526	
                20	
  
        2	
  
                                                                                                                                                                                   16	
  
    1.8	
  
    1.6	
                                                                                                                                                                          12	
  
                                                                                                                                                                                                                                                                                                      9	
              9	
  
    1.4	
  
    1.2	
                                                                                                                                                                            8	
  
        1	
                                                                                                                                                                                                                            3	
  
                                                                                                                                                                                     4	
                               2	
                                                            2	
  
    0.8	
                                                                                                                                                                                                                                                        1	
  
                                                                                                                                                                                             0	
       0	
                                             0	
                  0	
  
    0.6	
                                                                                                                                                                            0	
  
    0.4	
                                                                                                                                                                                    1	
       2	
             3	
             4	
             5	
       6	
        7	
       8	
             9	
             10	
           11	
          12	
  
    0.2	
                                                                                                                                                                                                                                                          Subject	
  
        0	
  
                1	
      2	
             3	
             4	
             5	
       6	
        7	
      8	
             9	
             10	
         11	
      12	
  
                                                                                                                                                                                                                                                                                                                                         μ	
  =	
  6.2	
  
                                                                                     Subject	
                                                                                                                                                         Score	
  (MSD)	
                                                                  σ2	
  =	
  29.73	
  
                                                                                                                                                                                   24	
  
                                                                                                                                                    μ	
  =	
  0.576	
  
                                                                 AScore	
  (MSD)	
                                                                  σ2	
  =	
  0.202	
             20	
  
       2	
                                                                                                                                                                                                                                                                                                                                        16	
  
                                                                                                                                                                                   16	
                                                                                                                                         14	
  
    1.8	
  
    1.6	
                                                                                                                                                                          12	
  
    1.4	
                                                                                                                                                                                                                                                      9	
  
                                                                                                                                                                                     8	
                                                                                                                      7	
  
    1.2	
                                                                                                                                                                                                                                                                                     6	
  
                                                                                                                                                                                                                                                                              5	
  
       1	
  
                                                                                                                                                                                     4	
                       2	
             2	
  
    0.8	
                                                                                                                                                                                                                                      1	
  
                                                                                                                                                                                               0	
  
    0.6	
                                                                                                                                                                            0	
  
    0.4	
                                                                                                                                                                                      1	
             2	
             3	
             4	
             5	
            6	
             7	
             8	
                9	
              10	
  
    0.2	
                                                                                                                                                                                                                                                          Subject	
  
       0	
  
                 1	
             2	
             3	
             4	
             5	
           6	
             7	
             8	
              9	
          10	
  
                                                                                     Subject	
  
                                                                                                                                                                                                                 Fig. 6. Score of TA and MSD treatment

                          Fig. 5. AScore of TA and MSD treatments                                                                                                                                         TABLE III
                                                                                                                                                                                C ORRELATIONS BETWEEN D EPENDENT VARIABLES AND D EMOGRAPHICS
   As the third variable Score is directly computed from AScore
                                                                                                                                                                                 Dependend                     Previous Courses                                           Bachelor/Master                                         Confidence
and Answered, it exhibits a similar pattern (see Figure 6). In the                                                                                                                Variable                     r        p                                                r          p                                           r        p
TA group, two subjects achieved 20 or more points, close to the                                                                                                                  Answered                      0.207 0.355                                               −0.247 0.267                                           0.56     0.007
maximum of 24. However, many subjects in this group have low                                                                                                                      AScore                       0.122 0.588                                               0.116      0.606                                       0.513 0.015
                                                                                                                                                                                   Score                       0.17     0.45                                             0.074      0.745                                       0.557 0.007
total scores. As subjects in the MSD group have significantly
higher values in the Answered metric, their average Score
values are higher, even though the average score AScore is
lower for this group.                                                                                                                                                       average, a clear grasp of whether they understood the instrument
   In summary, we can state that MSDs are significantly quicker                                                                                                             or not. Interestingly, both the number of previous courses
to comprehend. Therefore, if speed is a relevant factor, MSDs                                                                                                               and the education level only show a small correlation with
should be chosen instead of TA. One could argue that speed                                                                                                                  the dependent variables. Similarly, previous experience in
itself is not relevant, as long as AScore is low. Therefore, we                                                                                                             Software Development, Software Modelling, and Requirements
plan to replicate the experiment with subjects who are more                                                                                                                 Engineering has small correlation with the dependent variables,
familiar with the modelling languages, in order to see whether                                                                                                              as depicted in Table IV. These results could indicate that
the difference in speed is still present.                                                                                                                                   the dependent variables were in fact influenced by other
                                                                                                                                                                            factors, such as confusion regarding the newly introduced
C. Correlation between Demographic Data and Dependent                                                                                                                       modelling languages. However, they could also indicate that
Variables                                                                                                                                                                   the understanding of the requirements is not dependent on
   We used the Pearson product-moment correlation coefficient                                                                                                               previous education and experience. Further replications will be
to assess the correlations between the three dependent variables                                                                                                            necessary in order to answer these questions in a satisfactory
and the number of related courses previously taken by                                                                                                                       manner.
students, the education level (Bachelor/Master), and the
subject’s confidence in their answers. The resulting values
                                                                                                                                                                                                          TABLE IV
for Pearson’s r and the p-value are depicted in Table III.                                                                                                                       C ORRELATIONS BETWEEN D EPENDENT VARIABLES AND E XPERIENCE
Assuming an effect size of r < 0.3 as small, an effect size of
0.3 ≤ r < 0.4 as medium, and an effect size of r ≥ 0.4                                                                                                                           Dependend                        Software Dev.                                            Modelling                                              Req. Eng.
                                                                                                                                                                                  Variable                      r          p                                             r       p                                             r        p
as large, we see that there is a large correlation between                                                                                                                       Answered                       0.08       0.724                                         0.054 0.811                                           0.092 0.685
all three dependent variables and the subject’s confidence                                                                                                                        AScore                        −0.044 0.846                                             0.053 0.816                                           0.063 0.781
in their results. This result indicates that subjects had, in                                                                                                                      Score                        −0.09      0.692                                         0.028 0.901                                           0.039 0.863


                                                                                                                                                                           23
          VII. C ONCLUSIONS AND F UTURE W ORK                                                      R EFERENCES
                                                                       [1] M. C. Otero and J. J. Dolado, “Evaluation of the comprehension of
   In this paper, we have presented the results of a controlled            the dynamic modeling in UML,” Information and Software Technology,
experiment with 22 students in an undergraduate course on                  vol. 46, no. 1, pp. 35–53, 2004.
software modelling. We studied the comprehensibility of                [2] C. Glezer, M. Last, E. Nachmany, and P. Shoval, “Quality and com-
                                                                           prehension of uml interaction diagrams-an experimental comparison,”
functional requirements modelled in two graphical languages,               Information and Software Technology, vol. 47, no. 10, pp. 675–692,
Modal Sequence Diagrams, a sequence-based notation, and                    2005.
Timed Automata, a state-based notation. Subjects received a            [3] A. Nugroho, “Level of detail in uml models and its impact on model
                                                                           comprehension: A controlled experiment,” Information and Software
model in one of the two languages and a questionnaire with                 Technology, vol. 51, no. 12, pp. 1670–1685, 2009.
questions testing their understanding of the model. While we           [4] M. Staron, L. Kuzniarz, and C. Wohlin, “Empirical assessment of
can not reject the null hypothesis, that there are no significant          using stereotypes to improve comprehension of uml models: A set
                                                                           of experiments,” Journal of Systems and Software, vol. 79, no. 5, pp.
differences between the two treatments, for both the average               727–742, 2006.
and the total questionnaire scores, subjects receiving the Modal       [5] N. Condori-Fernández, M. Daneva, K. Sikkel, R. Wieringa, O. Dieste,
Sequence Diagram specification answered significantly more                 and O. Pastor, “A systematic mapping study on empirical evaluation of
                                                                           software requirements specifications techniques,” in Proceedings of the
questions. This indicates that if the speed or the efficiency plays        2009 3rd International Symposium on Empirical Software Engineering
an important role, scenario-based models should be considered              and Measurement. IEEE Computer Society, 2009, pp. 502–505.
instead of the state-based models. However, further studies            [6] N. Condori-Fernández, M. Daneva, K. Sikkel, and A. Herrmann,
                                                                           “Practical relevance of experiments in comprehensibility of requirements
need to be conducted in order to understand whether this                   specifications,” in Empirical Requirements Engineering (EmpiRE), 2011
effect persists with more experienced users who achieve higher             First International Workshop on, Aug 2011, pp. 21–28.
overall scores.                                                        [7] S. Abrahão, C. Gravino, E. Insfran, G. Scanniello, and G. Tortora,
                                                                           “Assessing the effectiveness of sequence diagrams in the comprehension
   While our sample of students without a previous knowledge               of functional requirements: Results from a family of five experiments,”
of the used treatments can be seen as a possible threat to                 Software Engineering, IEEE Transactions on, vol. 39, no. 3, pp. 327–342,
                                                                           March 2013.
validity, this lack of experience is in fact a realistic setup         [8] D. Harel and S. Maoz, “Assert and negate revisited: Modal semantics
for industrial use in the automotive domain. As requirements               for UML sequence diagrams,” Software and Systems Modeling (SoSyM),
specifications are used across organisations and across roles              vol. 7, no. 2, pp. 237–252, May 2008.
                                                                       [9] R. Alur and D. L. Dill, “A Theory of Timed Automata,” Theoretical
within an organisation, it can not be assumed that the receiver            Computer Science, vol. 126, no. 2, pp. 183–235, 1994.
of a specification is always familiar with every detail of the used   [10] K. G. Larsen, M. Mikucionis, B. Nielsen, and A. Skou, “Testing real-
language. Additionally, receivers are often no experts in mod-             time embedded software using uppaal-tron: An industrial case study,”
                                                                           in Proceedings of the 5th ACM International Conference on Embedded
elling, but in other areas such as requirements engineering or             Software. ACM, 2005.
system design. Therefore, in contrast to, for example, software       [11] A. Fehnker, “Scheduling a steel plant with timed automata,” in rtcsa.
development, the receivers of a requirements specification can             IEEE, 1999, p. 280.
                                                                      [12] J. Greenyer, M. Haase, J. Marhenke, and R. Bellmer, “Evaluating a
not be expected to be experts in the used language. Additionally,          formal scenario-based method for the requirements analysis in automotive
our results indicate that the current practice, choosing the               software engineering,” in Proceedings of the 2015 10th Joint Meeting
modelling language based on convenience, is not a threat to                on Foundations of Software Engineering. ACM, 2015.
                                                                      [13] O. M. Group, “Unified modeling language,” http://www.uml.org/, Jun.
the comprehension of the specifications in itself.                         2014.
   In the future, we will replicate the experiment both with          [14] E. Kamsties, A. von Knethen, and R. Reussner, “A controlled experiment
different groups of students and with professionals from our               to evaluate how styles affect the understandability of requirements
                                                                           specifications,” Information and Software Technology, vol. 45, no. 14,
industrial partners in order to eliminate possible bias and to             pp. 955–965, 2003, eighth International Workshop on Requirements
assess whether experience and a deeper knowledge of the                    Engineering: Foundation for Software Quality.
languages can have a significant impact on the understanding.         [15] G. Scanniello, M. Staron, H. Burden, and R. Heldal, “On the effect of
                                                                           using SysML requirement diagrams to comprehend requirements: Results
Additionally, we will aim at generating a theory on which                  from two controlled experiments,” in 18th International Conference on
languages are suitable for which kind of task or system when               Evaluation Assessment in Software Engineering (EASE), May 2014, pp.
modelling requirements.                                                    433–442.
                                                                      [16] W. Damm and D. Harel, “LSCs: Breathing life into message sequence
                                                                           charts,” in Formal Methods in System Design, vol. 19. Kluwer Academic,
                     ACKNOWLEDGEMENT                                       2001, pp. 45–80.
                                                                      [17] D. Harel and R. Marelly, Come, Let’s Play: Scenario-Based Programming
                                                                           Using LSCs and the Play-Engine. Springer, August 2003.
   We would like to express our gratitude to Nadja Marko              [18] J. Bengtsson and W. Yi, “Timed automata: Semantics, algorithms and
and Christian Webel, who helped in reviewing and discussing                tools,” in Lectures on Concurrency and Petri Nets, vol. 3098. Springer,
an early experiment design. Additionally, we would like to                 2003, pp. 87–124.
                                                                      [19] V. R. Basili, “Software modeling and measurement: The
thank Pariya Kashfi and Vard Antinyan for participating in                 goal/question/metric paradigm,” Tech. Rep., 1992.
the pilot experiment. The research leading to these results has       [20] C. Wohlin, P. Runeson, M. Höst, M. C. Ohlsson, and B. Regnell,
received partial funding from the European Union’s Seventh                 Experimentation in Software Engineering. Springer, 2012.
                                                                      [21] W. Damm and D. Harel, “LSCs: Breathing life into message sequence
Framework Program (FP7/2007-2013) for CRYSTAL-Critical                     charts,” in Formal Methods in System Design, vol. 19. Kluwer Academic,
System Engineering Acceleration Joint Undertaking under grant              2001, pp. 45–80.
agreement No 332830 and from Vinnova under DIARIENR
2012-04304.


                                                                  24