=Paper=
{{Paper
|id=Vol-392/paper-4
|storemode=property
|title=Assessing the Power of A Visual Notation - Preliminary Contemplations on Designing a Test
|pdfUrl=https://ceur-ws.org/Vol-392/Paper4.pdf
|volume=Vol-392
}}
==Assessing the Power of A Visual Notation - Preliminary Contemplations on Designing a Test==
MODELS`08 Workshop ESMDE
Assessing the Power of A Visual Notation
- Preliminary Contemplations on Designing a Test -
Dominik Stein and Stefan Hanenberg
Universität Duisburg-Essen
{ dominik.stein, stefan.hanenberg }@icb.uni-due.de
Abstract. This paper reports on preliminary thoughts which have been
conducted in designing an empirical experiment to assess the comprehensibility
of a visual notation in comparison to a textual notation. The paper sketches
shortly how a corresponding hypothesis could be developed. Furthermore, it
presents several recommendations that aim at the reduction of confounding
effects. It is believed that these recommendations are applicable to other
experiments in the domain of MDE, too. Finally, the paper reports on initial
experiences that have been made while formulating test questions.
1 Introduction
Although modeling does not imply visualization, people often consider the visual
representation of models to be a key characteristic of modeling. One reason to this
could be that modeling techniques such as State Machines or Petri-Nets are often
taught and explained with help of circles and arrows rather than in terms of
mathematical sets and functions. Apart from that, other kinds of modeling, e.g. data
modeling with help of Entity-Relationship-Diagrams, make heavy use of visual
representations, although the same concepts could be specified in a purely textual
manner, too.
However, let alone the impression that visual representations are considered very
appealing by a broad range of developers, customers, maintainers, students, etc., a
scientific question would be if visual representations actual yield any extra benefit to
software development, maintenance, or teaching, etc.
Driven by a personal belief of the authors that this extra benefit exists, this paper
reports on preliminary thoughts which have been conducted in designing an empirical
experiment. The goal of this empirical experiment is to assess (such a "soft" property
as) the "comprehensibility" of a visual notation in comparison to a textual notation.
This paper does not formulate a concrete hypothesis. Instead, it conducts general
contemplation about hypotheses that are concerned with the evaluation of
"comprehensibility". In particular, the paper presents several recommendations that
aim at the reduction of confounding effects while running the test. It is suggested that
these recommendations should be obeyed in other experiments in the domain of
MDE, too. Furthermore, the paper reports on experiences that have been made while
formulating the test questions for a test on comprehensibility.
The paper is structured as follows: In section 2, the process to define a hypothesis
is outlined. In sections 3 and 4, a couple of considerations are presented in order to
31
MODELS`08 Workshop ESMDE
reduce confounding effects. In section 5, problems are presented which have been
encountered while formulating test questions. Section 6 presents some related work.
And section 7 concludes the paper.
2 Defining the Goal of the Experiment, and What to Measure?
When designing an controlled experiment, everything is subordinate to the overall
assumption, or hypothesis, that is going to be tested. Usually, the development of the
test will require to repeatedly reformulate (refine) the hypothesis since the initial
hypothesis turned out not to be (as easily) testable (as presumed). (A possible reason
for this could be, for example, that it is overly hard, or impossible, to reduce the
impact of confounding effects or to find suitable questions; cf. sections 3, 4 and 5.)
2.1 Experiment Definition
A fist step could be to define the experiment in general. When comparing visual vs.
textual notations, this could be done as follows (using the experiment definition
template suggested by [10]):
The goal of the study is to analyze visual and textual program specifications (i.e.
diagrams versus code), with the purpose of evaluating their effect on the
"comprehensibility" of the information shown. The quality focus is the perception
speed and completeness and correctness with which all relevant information is
apprehended. The perspective is that of teachers and program managers, who would
like to know the benefit that visual notations can bring to their work (i.e. teaching
students in computer science or developing software). The context of the experiment
is made up of artificial/sample code snippets and their corresponding diagrams
(= objects) as well as undergraduate and graduate students (= subjects).
2.2 Hypothesis Formulation
According to [4], a scientific hypothesis meets the following three criteria:
• A hypothesis must be a "for-all" (or rather a "for-all-meeting-certain-
criteria") statement. This means in particular that the hypothesis must be true
for more than a singular entity or situation.
• A hypothesis must be (able to be reformulated as) a conditional clause (of
the form "whenever A is true/false, this means that B is (also) true/false").
• A hypothesis must be falsifiable. That means that, in principle, it must be
able to find an entity or situation in which the hypothesis is not true.
Furthermore, for practical reasons, [1, 5] suggest to base the hypothesis on
observable data. That is, in the (possibly reformulated) conditional clause, the value
of one observable data (called "the dependent variable") must be specified to depend
on the value of one other observable data (called "the independent variable") in a
consistent way. The hypothesis is falsified if at least one example can be found where
this dependency is not satisfied.
32
MODELS`08 Workshop ESMDE
A starting point to find a hypothesis for the experiment outlined in section 2.1
could be the following:
When investigating program specifications, a visual representation X (as
compared to a textual representation Y) significantly facilitates comprehension of
information Z.
Following the aforementioned criteria, the above hypothesis is a scientific
hypothesis because it can be rephrased as "whenever a program specification is
represented using a visual notation X, it is easier to comprehend (with respect to
information Z) than an equivalent representation using a textual notation Y". In this
statement, the possible values (treatments) of the independent variable (factor) are
"visual/not visual" and the possible values of the dependent variable are "easier to
comprehend/not easier to comprehend". The claimed dependency would be "visual →
easier to comprehend". The statement could be falsified by showing that visual
notation X is not easier to comprehend than textual notation Y (with respect to
information Z).
semantic equality background knowledge
of the objects and skills of the subjects
semantic familiarity of
semantic compression comparing a visual and the subjects with the
of the objects a textual notation notations to test
syntactic representation
of the objects
Fig. 1. Confounding impacts.
2.3 Variable Selection
Turning a the preliminary idea of a hypothesis into a testable hypothesis which is
thoroughly rooted on objectively observable data is a challenging task in developing
an empirical test. For example, since comprehensibility by itself is difficult to
observe, another variable must be found whose values are considered to inherently
depend on the level of comprehension of a tester. A commonly accepted variable
measuring the level of comprehension, for example, is "correctness", i.e. the number
of correct answers1 given to a (test) questions (cf. [8, 7, 6]). However, as pointed out
by [7], correctness is only one facet of comprehensibility. Another variable is
"comprehension speed", e.g. the number of seconds that the subjects looked at the
object (or maybe even "easy to remember", i.e. the number of times that the subjects
1 if the correct answer consists of multiple elements, it could be some mean of precision and
recall [2] (cf. [6]).
33
MODELS`08 Workshop ESMDE
looked at the objects; cf. [7]). The inherent effect of the variable that is of interest on
the variable that is measured must be substantially elucidated (and defended) in the
discussion on the (construct) validity of the test.
The only factor (independent variable) in the experiment would be "kind of
presentation" with the treatments (values) {visual, textual}.
One of the big challenges when investigating the casual dependencies between the
(dependent and independent) variables is to reduce confounding impacts (see Fig. 1)
as much as possible, and thus to maximize the validity of the experiment (cf. [10]).
Otherwise, the "true" dependency could possibly be neutralized (at least, in parts), or
might even be turned into its reciprocal direction (in the worst case).
In the following sections, some means are presented which should be taken in
order to improve the validity of an experiment comparing a visual and a textual
notation. The authors believe that these means are general enough to be applied to
other evaluating experiments in the domain of MDE approaches, too.
3 Preparing Objects – Ensuring Construct Validity (I)
Construct validity refers "to the extent to which the experiment setting actually
reflects the construct under study" [10]. In particular, this means to ensure that the
objects of the experiment which are given to the subjects in order to perform the tests
represent the cause well (i.e. a visual vs. a textual representation, in this case).
3.1 Semantic Equality
One obvious, yet carefully to ensure, requirement is to compare (visual and textual)
representations that have equal semantics, only. It would be illegal and meaningless to
compare any two representations with different semantics.
class A {
A B b; A
a } a
b
⇔ class B {
⇔
b
A a;
B } B
Fig. 2. Ensure semantic equality.
Fig. 2 shows an example. It would be illegal to compare the visual representation
on the left with the textual representation in the middle since they mean different
things. The bidirectional association between classes A and B in the UML model in
the left of Fig. 2 denotes that two instances of class A and B are related to each other
such that the instance of class A can navigate to the instance of class B via property b,
while at the same time the instance of class B can navigate to the instance of class A
via property a (meaning a = a.b.a is always true). The Java program code in the
34
MODELS`08 Workshop ESMDE
middle of Fig. 2, however, does not imply that an instance of class A which is
associated with an instance of class B (via its property b) is the same instance which
that associated instance of class B can navigate to via its property a (meaning a =
a.b.a does not need to be true).
Hence, in an empirical test comparing the performance2 of visual vs. textual
representations of associations, it would be more appropriate (in fact, obligatory) to
compare the textual representation in the middle of Fig. 2 with the visual
representation in the right of Fig. 2. Now, the semantic meaning of one notation is
equally represented in the other notation, and comparing the results of their individual
performance is valid3.
3.2 Equal Degree of Compression
Apart from semantic equality, the expressions being compared need to be expressed at
an equal degree of compression (here, the degree of compression shall refer to the
degree with which semantic information is condensed into one language construct).
Otherwise, "better" performance of one notation could be induced by the fact that one
notation uses a "higher" compression (e.g. one language construct of that notation
conveys the same semantic information than four language constructs of the other
notation) rather than that it uses a "better" representation.
class A {
B b;
B getB() { return b; }
A void setB(B b) { this.b = b; b.a = this; }
a }
b
⇔ class B {
A a;
B A getA() { return a; }
void setA(A a) { this.a = a; a.b = this; }
}
Fig. 3. Do not test expressions of unequal degree of compression.
Fig. 3 gives an example. Other than in Fig. 2, the Java code now contains extra
lines which states that an instance of class A which is associated with an instance of
class B (via its property b) must be the same instance to which that associated instance
of class B can navigate via its property a (meaning a = a.b.a is always true).
Hence, the Java expression in the right of Fig. 3 now equates to the semantics of the
UML expression in the left of Fig. 3.
2 in this paper, "performance" refers to "the notation's ability to be read and understood" rather
than computation speed.
3 note that asserting the semantic equality of two notations is not trivial. For example, there is
no general agreement on how a UML class diagram should be transformed into Java code.
35
MODELS`08 Workshop ESMDE
If – in a test – the UML expression should actually yield "better" results than the
Java expression now, it is unclear (and highly disputable) whether the "better"
performance is due to the visual representation or due to the higher degree of
compression (i.e. the fact that we need to read and understand four method definitions
in the Java code as compared to just one association in the UML diagram).
3.3 Presenting Objects
Apart from equal semantics and equal degree of compression, the expressions have to
be appropriately formatted, each to its cleanest and clearest extent. This is because the
authors estimate that disadvantageous formatting of expressions could have a negative
impact on the test outcome, whereas advantageous formatting of expressions could
improve the test results.
Fig. 4 gives an example. In the left part of Fig. 4, the Java code has been formatted
in a way which is tedious to read. In the right part of Fig. 4, the UML representation
has been formatted disadvantageously. With expressions formatted like this, it is
assumed that the respective notation is condemned to fail in the performance test.
class A { private B
b; B getB() { return
b; } void setB(B b) { this class A {
A .b = b; b.a = this; } } B b; A
a } a
⇔ classa;BA{getA()
private A
class B {
⇔
b b
{ return a; } void A a;
B setA(A a) { this } B
.a = a; a.b =
this; } }
Fig. 4. Format expressions to their cleanest and clearest extent.
Unfortunately, there usually is no (known) optimal solution for the formatting task.
Therefore, expressions should be formatted clearly and consistently following some
strict and predefined guidelines (e.g. some formatting guidelines such as the [9]). It is
important to keep in mind, though, that even though uniform guidelines are used to
format the expressions, the effects of those formatting guidelines on the test outcomes
are unclear. Moreover, the effects may even be different for each notation.
Consequently, the (unknown) impact of formatting guidelines on the test results needs
to be respected in the discussion of the (construct) validity of the test.
Likewise, syntactic sugar is to be avoided. That means, all means that are not
related to the semantics of the underlying notation, such as syntax highlighting in
textual expressions, or different text formats and different line widths in visual
expressions, should not be used. Syntactic sugar (fonts, line width, colors, etc.) are
likely to draw the attention of the testers to different parts of the expressions and thus
may confound the pure comparison between their visual and textual representation.
36
MODELS`08 Workshop ESMDE
Evaluating the impacts of formatting, fonts, line width, and colors on the
comprehensibility of a notation is an interesting test of its own. However, that test
should focus on the comparison of different style guidelines for one notation rather
than on the comparison of (different) guidelines for different notations.
4 Preparing Subjects – Ensuring Internal Validity
To ensure internal validity, it must be ensured that a relationship between a treatment
and an outcome results from a causal relationship between those two, rather than from
a factor which has not been controlled or has not been measured (cf. [10]). In
particular this means how to "treat", select, and distribute the subjects such that no
coincidental unbalance exists between one group of testers and another.
4.1 Semantic Familiarity
The imperative necessity of comparing semantically equivalent "expressions" (see
section 3.1) is complemented with the necessity that testers are equally trained in, and
familiar with, both notations. Otherwise, i.e. if the testers of one notations are more
experienced with their notation than the testers of the other notation with their
notation, a "better" test result of the former notation could be induced by the fact that
its testers have greater experience in using/reading it rather than by the fact that it is
actually "better" (in whatsoever way). This is particularly probable whenever the
performance of new notations shall be evaluated in contrast to existing ones.
One way to control the knowledge of the tested notations is to look for testers that
are not familiar with both notations, and have them take a course in which they learn
the notations to test. This approach seems particularly practicable in academia – even
though the test results will usually assert the performance of "beginners", and thus
make extrapolation to the performance of "advanced" software developers in
industrial settings difficult (which does not mean that assessing the benefits of visual
notations for "beginners" isn't worthwhile and interesting). This problem represents a
threat to the external validity of the experiment (cf. [10]).
The goal of teaching the notations to novices is to ensure that the testers of each
notation attain similar knowledge and skill with their notation. The challenge here is
to defined what it means that testers are "equally familiar" (i.e. equally knowing and
skilled) with their notations. It also needs to be investigated how the knowledge and
skills of an individual tester with his/her notation can be actually assessed (so that we
can decide afterwards whether or not "equal familiarity" has been reached). Another
challenge is how "equal familiarity" can be achieved by a teaching course in a timely
and didactically appropriate manner (e.g., what is to be done if a particular group of
testers encounters unforeseen comprehension problems with their notation).
The knowledge and skill test could occur prior to the actual performance test, or
intermingled with the performance test (in the latter case, some questions test the
knowledge and the skills of the testers, while other questions test the performance of
the notations). If the knowledge and skill test reveals that the semantic familiarity of
the testers with their notation is extremely unbalanced (between the groups of testers),
the test outcome must be considered meaningless.
37
MODELS`08 Workshop ESMDE
5 Measuring Outcomes – Ensuring Construct Validity (II)
Once the hypothesis is sufficiently clear, the next challenging step is to formulate
questions that are suitable to test the hypothesis and to find a test format that is
suitable to poll the required data. This is another facet of construct validity, according
to which the outcome of the test needs to represent the effects well (cf. [10]).
In this section, considerations and experiences are presented that have been made
in designing a test evaluating the comprehensibility of a visual notation.
5.1 Test Format, and How to Measure?
Multiple Choice tests (when carefully designed; cf. [3]) are considered to be a good
and reliable way to test the knowledge of a person, in particularly in comparison to
simple True/False tests. Hence, Multiple Choice tests would have a higher construct
validity with respect to the correctness of comprehension than True/False tests. A
question format with free answer capabilities would be more realistic (and thus would
increase the external validity of the experiment; cf. [10]). However, such short-answer
test is much more laborious because it requires manual post-processing in order to
detect typos and/or semantically equivalent answers.
When it comes to measuring the response time, it is important to discriminate
between the time to find the answer in the expression and the time to understand the
question. This is because if testers need 30 sec. to understand a question and then 10
sec. to find the answer in the textual expression and just 5 sec. to find the answer in
the visual expression, it makes a difference whether 40 sec. are compared to 35 sec.,
or 10 sec. to 5 sec. Not to discriminate between the time to find an answer and the
time to understand a question is only valid, if the ratio is reciprocal, i.e. if the time to
understand a question is negligible short in comparison to the time to find the answer.
If the test outcome consists of more than one data, it is a big challenge to define
how the outcomes can be combined in order to obtain a meaningful interpretation. In
this case, for example, it needs to be decided how "correctness of answers" and
"response time" can be combined to indicate a "level of comprehension". One option
would be to disregard all incorrect answers, and consider the response time of correct
answers, only.
5.2 Volatile (Time) Measurements – Problems of A First Test Run
Preliminary and repeated test runs of a test evaluating simple analysis of one textual
expression4 (with the same person) have shown that the measured time needed to
answer the question (exclusive of the time needed to understand the question; cf.
section 5.1) is rather short (in average ~10 sec.) and varies tremendously (3 sec. to
30+ sec., even for same or similar questions!). It seems as if the measured time is
heavily confounded by some external factor (maybe slight losses of concentration).
This is problematic because due to the short (average) response time, even the
slightest disturbance (of about 1 sec.) could confound the measured (average) time
significantly (e.g. by one tenth, in this case).
4 in another case than association relationships
38
MODELS`08 Workshop ESMDE
Another problem was to strictly discriminate between the time to find the answer
in the expression and the time to understand the question (which, again, was essential
due to the short (averaged) response time). The testers were required to explicitly flip
to the expression once they have carefully read (and understood) the question (which
was shown first). As it turned out, however, testers sometimes realized that they have
not fully understood the question after they have already flipped to the expression. As
a result, the measured response time was partly confounded.
It is currently being investigated how the problem of high variation in
measurements can be tackled. One option would be to pose questions that are more
difficult to answer, and thus takes more time. This will only work, though, if the
confounding effects do not grow proportionally. Another option would be to repeat
the test countless times (with the same person and similar questions) in order to get a
more reliable average response time. A big problem of this approach is to ensure that
the testers will not benefit from learning effects in the repeated tests.
A promising solution to properly discriminate between the time to find the answer
in the expression and the time to understand the question has been found in [7].
6 Related Work
In 1977, Shneiderman et al. [8] have conducted a small empirical experiment that
tested the capabilities of flow charts with respect to comprehensibility, error
detection, and modification in comparison to pseudo-code. Their outcome was that –
statistically – the benefits of flow charts was not significant. Shneiderman et al. did
not measure time, though.
This was determined to be inevitable by Scanlan [7]. Scanlan formulated five
hypotheses (e.g. "structured flow charts are faster to comprehend", "structured flow
charts reduce misconceptions", to name just the two which are closely related to this
paper). Scanlan's test design is very interesting: Scanlan separated comprehension
(and response) time of the question from comprehension time of the expression. To
do so, testers could either look at the question or look at the expression (an algorithm,
in this case). This is an interesting solution for the aforementioned problem of
separating comprehension time and response time (see section 5.1). Scalan's outcome
was that structured flow charts are beneficial.
7 Conclusion
This paper has presented preliminary thoughts which have been conducted in
designing an empirical experiment to assess the comprehensibility of visual notations
in comparison to textual notations. The paper has discussed shortly how a
corresponding hypothesis could be developed. Furthermore, it has presented several
recommendations that aim at the reduction of disturbances in the measured data,
which are considered to be helpful for other experiments in the domain of MDE, too.
Finally, the paper has reported on initial experiences that have been made while
formulating the test questions.
39
MODELS`08 Workshop ESMDE
It needs to be emphasized that this paper presents preliminary considerations rather
than sustainable outcomes. On the contrary, each of the presented contemplations
could be subject of an empirical evaluation of itself (e.g. whether or not advantageous
formatting really has an positive effect on comprehensibility). Also, decisions need to
be made about how to execute the test (e.g. how textual and visual expressions are
shown to the testers, if they can use zooming or layouting functions, etc.) . The
authors plan to pursue the considerations presented here and, ultimately, come up
with a test design. Getting there will require many (self-)tests before finally a test
design will be found that is capable to assess the specified hypothesis reliably.
Acknowledgement
The authors thank the anonymous reviewers for their patients with the tentativeness of
these contemplations and for their productive comments which have helped to further
advance the test design.
References
[1] Bortz, J., Döring, N., Forschungsmethoden und Evaluation für Sozialwissenschaftler
(Research Methods and Evaluation for Social Scientist), Springer, 1995
[2] Frakes, W.B., Baeza-Yates, R., Information Retrieval: Data Structures and Algorithms,
Prentice-Hall, 1992
[3] Krebs, R., Die wichtigsten Regeln zum Verfassen guter Multiple-Choice Fragen (Most
Important Rules for Writing Good Multiple-Choice Questions), IAWF, Bern, 1997
[4] Popper, K., Logik der Forschung, 1934 (The Logic of Scientific Discovery, 1959)
[5] Prechelt, L., Kontrollierte Experimente in der Softwaretechnik (Controlled Experiments in
Software Engineering), Springer, 2001
[6] Ricca, F., Di Penta, M., Torchiano, M., Tonella, P., Ceccato, M., The Role of Experience
and Ability in Comprehension Tasks supported by UML Stereotypes, Proc. of ICSE'07,
IEEE, pp. 375-384
[7] Scanlan, D.A., Structured Flowcharts Outperform Pseudocode: An Experimental
Comparison, IEEE Software, Vol. 6(5), September 1989, pp. 28-36
[8] Shneiderman, B., Mayer, R., McKay, D., Heller, P., Experimental investigations of the
utility of detailed flowcharts in programming, Communications of the ACM, Vol. 20(6),
1977, pp. 373-381
[9] Sun, Code Conventions for the Java Programming Language, April 20, 1999,
http://java.sun.com/docs/codeconv/
[10] Wohlin, C., Runeson, P., Host, M., Ohlsson, M., Regnell, B., Wesslen, A.,
Experimentation in Software Engineering - An Introduction, Kluwer, 2000
40