=Paper= {{Paper |id=Vol-1522/Cuenca2015HuFaMo |storemode=property |title=Empirical Study: Comparing Hasselt with C# to Describe Multimodal Dialogs |pdfUrl=https://ceur-ws.org/Vol-1522/Cuenca2015HuFaMo.pdf |volume=Vol-1522 |dblpUrl=https://dblp.org/rec/conf/models/CuencaBLC15 }} ==Empirical Study: Comparing Hasselt with C# to Describe Multimodal Dialogs== https://ceur-ws.org/Vol-1522/Cuenca2015HuFaMo.pdf
     Empirical Study: Comparing Hasselt with C# to
               describe multimodal dialogs
                               Fredy Cuenca, Jan Van den Bergh, Kris Luyten, Karin Coninx
                                               Hasselt University - tUL - iMinds
                                               Expertise Centre for Digital Media
                                             Wetenschapspark 2, 3590 Diepenbeek
                                                            Belgium
                        Email: {fredy.cuencalucero,jan.vandenbergh,kris.luyten,karin.coninx}@uhasselt.be


   Abstract—Previous research has proposed guidelines for cre-            Concretely, we created a language, called Hasselt, that
ating domain-specific languages for modeling human-machine             allows notations for declaring multimodal events and human-
multimodal dialogs. One of these guidelines suggests the use           machine dialogs separately. The multimodal events are textu-
of multiple levels of abstraction so that the descriptions of
multimodal events can be separated from the human-machine              ally declared as combinations of predefined user events (e.g.
dialog model. In line with this guideline, we implemented Hasselt,     mouse clicks, speech inputs, etc.). The multimodal dialog
a domain-specific language that combines textual and visual            is depicted as finite state machines (FSMs) whose arcs are
models, each of them aiming at describing different aspects of         labelled with multimodal event names.
the intended dialog system.                                               In order to evaluate the benefits of such separation of
   We conducted a user study to measure whether the proposed
language provides benefits over equivalent event-callback code.        concerns, a user study was conducted. Participants had to
During the user study participants had to modify the Hasselt           sequentially modify two equivalent implementations of a mul-
models and the equivalent C# code. The completion times                timodal dialog system. In one case, both the code for handling
obtained for C# were on average shorter, although the difference       the events and the code for handling the dialog were included
was not statiscally significant. Subjective responses were collected   in the same source file written in C#. In the other case, these
using standardized questionnaires and an interview, which both
indicated that participants saw value in the proposed models. We       were specified separately with the textual and visual notations
provide possible explanations for the results and discuss some         provided by Hasselt.
lessons learned regarding the design of the empirical study.
   Index Terms—Multimodal systems, Human-machine dialog,                                   II. RELATED WORK
Finite state machines, Dialog model, Domain-specific language.
                                                                       A. Modeling multimodal dialogs as FSMs
                       I. I NTRODUCTION                                   When modeling human-machine dialogs as finite state ma-
   Multimodal systems allow users to communicate through               chines (FSM), the nodes of the FSM represent the possible
the coordinated use of multiple input modes, e.g. speech, gaze,        states of the dialog system, and its arcs represent the transitions
and gestures. These systems have the potential to support a            in the dialog system’s state. Many researchers have proposed
human-machine communication that is robust (e.g. multiple              FSM-based solutions for modeling unimodal human-machine
inputs can be combined to perform disambiguation), flexible            dialogs, e.g. IOG [10], SwingStates [11], Schwarz’s framework
(e.g. users can choose their preferred modality), and more             [12], and InterState[13], among others.
natural than ever before.                                                 However, there are only few languages that allow modeling
   However, implementing multimodal systems is still a dif-            multimodal human-machine dialogs as FSMs. Some represen-
ficult task. This is partly because of the complexity of               tative examples are listed in what follows.
multimodal interaction [1], [2], the absence of standardized              We can consider MEngine [4] as a FSM-based language that
methodology [2], and the mastering of different state of the           allows modeling trivial multimodal dialogs, e.g. the system
art technologies required for their construction [3].                  reponses are always the same for a given multimodal input.
   Several domain-specific languages have been proposed with           We can consider MEngine [4] as a FSM-based language that
the intention of simplifying the implementation of multimodal          allows modeling trivial multimodal dialogs, e.g. the system
interfaces [3]–[9]. From an analysis of these languages, Du-           reponses are always the same for a given multimodal input.
mas et al. [3] proposed many guidelines for developing future             In NiMMiT [14], the dialog model is a state machine
languages. One of these guidelines states that the specialized         where each state represent a set of tasks that are available
language must be such that the declaration of multimodal               to the end user. NiMMiT is restricted to interactive virtual
events can be separated from the description of the human-             environments (IVE) since its presentation model has to be
machine dialog (Figure 1). The present research has imple-             encoded in VRIXML [15]. In contrast, with our proposal, the
mented this idea and measured how potential users can benefit          presentation model can be implemented in any .NET language,
from such an implementation.                                           which opens a wide assortment of possibilities beyond IVE.




                                                                   25
                                                                                                      III. H ASSELT
                                                                                Hasselt provides notations for creating executable specifi-
                                                                              cations of multimodal human-machine dialogs. It comes with
                                                                              a complete User Interface Management System (UIMS) [17]
                                                                              that offers the editors, runtime environment, and debugging
                                                                              tools required to code, run, and test Hasselt specifications.

                                                                              A. Running Example
                                                                                 In the remainder of the paper, we will show how to
                                                                              implement a simple multimodal dialog system with Hasselt.
                                                                              The front-end of our running example system is shown in Fig-
Fig. 1. On the left side of the diagram, one can see the different levels     ure 3, BE. It allows end users to issue multimodal commands
of abstraction proposed by Dumas et al. [3]. On the right side, one can see   to create, move, and remove objects from a canvas that is
how our language follows the same framework: our visual language is at the    initially empty. These commands may be enabled or disabled
dialog level whereas our textual notations are at the events level.
                                                                              depending on the current context-of-use.
                                                                                 Users can create new objects by issuing voice commands
   SMUIML [3] provides different notations for declaring the                  like ‘create green box here’ while clicking on the canvas to
human-machine dialog and for combining user events. Unlike                    indicate the position of the new object. Boxes are reshuffled by
Hasselt, SMUIML does not include a symbol for defining                        issuing ‘put that there’ while clicking on both the target object
iterative events. This reduces the space of multimodal events                 and its new position [18]. And the canvas can be cleared up
that can be specified with SMUIML in comparison with                          in reaction to the voice command ‘remove objects’. To make
Hasselt. For instance, the drag-and-drop, which involves an                   the system responses depend on the context-of-use, we added
arbitrary number of mouse-move events cannot be specified                     two rules: The boxes can only be moved if there are more
with SMUIML at the level of events. Another difference is                     than three of them on the canvas; and the canvas can only be
that, unlike Hasselt, SMUIML does not support state variables                 cleared up after the displacement of at least one object.
or conditional transitions at the dialog level.
                                                                              B. How to use Hasselt UIMS?
B. User studies. Interaction models vs. event-callback code
                                                                                The steps required to create a multimodal dialog system
   To the best of our knowledge, none of the abovementioned                   with Hasselt UIMS are as follows.
multimodal dialog modeling languages have been evaluated in
user studies. Nonetheless, outside the multimodal domain, we                     1) Implementing a back-end application: One must create
found two user studies that guided us in the design of our                    an executable program implementing the front-end and the
experiments.                                                                  handling methods of the intended system. For the purpose
   Oney et al. recruited 20 developers to evaluate the under-                 of this work, such program will be referred to as back-end
standability of Interstate’s visual notation. Each participant                application. The back-end application can be implemented
had to modify two systems (drag-and-drop and a thumbnail                      with any .NET programming language to be subsequently
viewer) implemented in both RaphaelJS1 and InterState. It                     imported into Hasselt UIMS.
was verified that InterState models are faster-to-modify than
                                                                                 For the aforementioned running example, the back-end
equivalent event-callback code written in RaphaelJS [13].
                                                                              application implements the front-end shown in Figure 3, BE,
   The creators of Proton++ carried out two experiments with
                                                                              and the methods for creating, moving, and removing
12 programmers. Each participant was shown a multitouch
                                                                              virtual objects, i.e. C REATE O BJECT ( COLOR , X , Y ), P UT-
gesture specification and set of videos of a user performing
                                                                              T HAT T HERE ( X 1, Y 1, X 2, Y 2), and R EMOVE A LL O BJECTS ().
gestures. Gestures may be specified as a regular expression,
tablature, or with event-callback code and the participant
                                                                                 2) Declaring multimodal events: Hasselt allows combining
had to match the specification with the video showing the
                                                                              multiple user events into one single abstraction [9], [19].
described gesture. The results showed that the tablatures of
                                                                                 Programatically, user events can be combined through a set
Proton++ are easier-to-comprehend than equivalent regular
                                                                              of event operators that can be used in a recursive manner.
expressions and event-callback code [16].
                                                                              The operator F OLLOW ED BY (; ) indicates sequentiality
   Since real-world scenarios require programmers not only to
                                                                              of events, the operator OR(|) serves to specify alterna-
comprehend but to write programming code, we followed the
                                                                              tive events, AN D(+) represents simultaneity of events, and
schema of Oney et al [13]. We asked participants to perform
                                                                              IT ERAT ION (∗ ) is meant to specify repetitive events [9].
modifications with our language and with equivalent event-
callback code.                                                                   To describe the interactions to be supported by our run-
                                                                              ning example system, we used these operators to declare the
  1 http://raphaeljs.com/                                                     following multimodal events (Figure 2, a):




                                                                          26
                                 (a)                                                                                   (b)
Fig. 2. Hasselt UIMS during design time. (a) Textual editor for declaring multimodal events. It offers syntax highlighting, auto-completion popups, tooltip
messages, and other features that facilitate the editing of code. (b) Visual editor for depicting human-machine dialogs. The arrows of the FSM are annotated
with the multimodal events declared in (a). The arrows can include guard conditions.




 event putT hatT here =         speech.put ;
                                speech.that + mouse.downhx1, y1i;
                                speech.there + mouse.downhx2, y2i;
                                                              (1)


   event createObject =         speech.create ;
                                speech.anyhcolori;
                                speech.here + mouse.downhx, yi;
                                                               (2)

           event removeObjects =            speech.remove ;
                                                                         (3)
                                            speech.objects



   3) Binding multimodal events with event-handling call-
backs: Each multimodal event must be bound to a method
of the back-end application. At runtime, Hasselt UIMS will                     Fig. 3. Hasselt UIMS during runtime: The event viewer (EV) displays the
automatically launch these methods whenever their associated                   user events as detected by the recognizers. The variable browser (VB) shows
multimodal events are detected.                                                the event parameters values. The automata view (AV) presents animations
                                                                               showing the progressive detection of multimodal events. The back-end appli-
   For our running example, one has to bind the method                         cation (BE) the end user has to interact with was imported into Hasselt UIMS.
P UT T HAT T HERE ( X 1, Y 1, X 2, Y 2) to the event putT hatT here
shown in Equation 1. Similarly, methods C REATE O B -
JECT ( COLOR , X , Y ), and R EMOVE A LL O BJECTS () can be
bound to the events declared in Equation 2 and Equation 3                      to describe human-machine dialogs as extended finite state ma-
respectively. The multimodal events do not have to have the                    chines [20], i.e. state machines augmented with state variables
same name as their associated callbacks.                                       and guard conditions.
   We must highlight that the binding between multimodal                          In a Hasselt visual model, the circles represent the potential
events and callback functions is specified through a textual                   states of the dialog system, and the arcs represent the system’s
notation. With this notation, one can bind not only one                        state transitions. Each arc is annotated with a multimodal
but multiple callbacks to one single multimodal event,                         event whose occurrence causes the transition represented by
and to specify temporal and spatial constraints among the                      the arc. Additionally, one can use state variables to encode
constituents of a multimodal event. The notations for binding                  quantitative aspects of the dialog, e.g. the number of times a
multimodal events will not be presented herein and interested                  state (transition) is visited (traversed). The statements required
readers can refer to [19]. The focus of this paper is in                       to maintain the state variables can be annotated in the arcs of
the evaluation of the visual language that is used after the                   the extended state machine. Finally, guard conditions can also
definition and binding of multimodal events.                                   be annotated in the arcs of a FSM to restrict their associated
                                                                               state transitions.
  4) Describing the human-machine dialog: The visual editor                       The visual model shown in Figure 2, b describes the dialog
provided by Hasselt UIMS (Figure 2, b) enables programmers                     supported by our running example system. The circle labelled




                                                                            27
as 1 represents the state where canvas is empty; the circle 2
represents the state where there is at least one object on the
canvas; and the circle 3, the state where at least one object
has been moved. The system moves from the initial state 1
to state 2 upon the creation of the first object. It also moves
from state 2 to state 3 after the first displacement of an object.
The variable N is used (a) to count the number of objects
in the canvas –when this is relevant–, and (b) to condition
the displacement of objects, which should only be possible if
there are more than 3 objects on the canvas –notice the label
[N > 3]. Finally, the removal of objects sets the system to its
initial state: the circle labelled as 1.                                      Fig. 4. Programming experience of the 12 participants
   It can be proved that if event-callback code were used to
implement the running example system, the identification of
the system’s state would have required a series of nested               After the participant performs the requested changes with
if-else statements spread throughout a big portion of the            a language, he is asked to fill a post-task questionnaire. At
whole program. Rather, Hasselt models have fewer and simpler         the end of the whole experiment, i.e. after using Hasselt
conditional clauses that can be centralized in a FSM that pro-       and C#, the participant is asked to evaluate the usability of
vides a comprehensive overview of the human-machine dialog.          Hasselt UIMS and interviewed by the researcher.
There is one way to know whether these theoretical advantages           2) Participants: We recruited 12 participants, all of whom
are reflected into practical benefits for programmers, which is      are male. The programming experience of the participants
through a user study.                                                ranges from 4 to 13 years; their C# experience, between 1
                                                                     and 8 years (Figure 4).
                       IV. U SER S TUDY                                 3) Procedure: Before the beginning of the experiment, each
   The experiment aims at determining whether separating             participant was given a 10-minutes tutorial about Hasselt.
the declaration of events from the dialog model brings about         Participants had to describe a simple, Hello world-like multi-
benefits in favor of programmers.                                    modal interaction by following step-by-step instructions. The
                                                                     tutorial help participants get acquainted with the visual editor,
A. Hypothesis                                                        debugging tools, and runtime environment of Hasselt UIMS.
   We hypothesize that the maintainance of a multimodal                 Since all participants had experience with C# and MS Visual
dialog can be performed faster and/or more easily with Has-          Studio, there was no need for training in this respect.
selt, where the events can be described separately from the             For the experiment, the participant was presented with a
dialog model, than with C#, where the code for combining             system similar to the one herein used as a running example.
multimodal events is intermixed with the code for dialog             It allowed users to create and remove virtual objects from a
management.                                                          canvas in response to multimodal input. In the version given
                                                                     to participants, the objects could be created or removed at
B. Method                                                            any time, after which the end user was acknowledged with
   1) Study Design: The participants were evaluated one by           voice feedback. Participants were asked to change the system
one after receiving a training session.                              so that it can handle two contexts-of-use: the command to
   During the experiment, each participant was shown a mul-          remove objects must only be processed if there are objects on
timodal system with which he had to interact according to            the screen; otherwise, it should be ignored.
the indications of the researcher. Once the participant was             The aforementioned system was described with both C# and
familiar with the functionality of the system, he was shown          Hasselt. Each participant had to modify both sources within a
the source code/visual model of the system and asked to per-         time limit of 30 minutes per language.
form modifications in it. Each participant had to sequentially          4) Solution of the modeling task: With Hasselt, the required
perform the changes in both Hasselt and C#. The changes to           changes can be made by modifying the human-machine dialog
be performed were explained orally, but also written in a sheet      model only. Participants had to define different context-of-uses
that the participant can check during the experiment.                to distinguish whether the form is empty or has objects on it.
   While the participant modifies the code/visual model, the         Figure 5 shows two potential solutions.
researcher observes the changes made by the participant on              As to the C# code, participants had to declare one variable
a secondary monitor that replicates the screen in front of the       for counting the number of objects on the form. This variable
participant. In this way, for each language, the researcher can      has to be updated every time a new object is created and
measure the completion time of the task, count how many              whenever all the objects are removed from the form. It also
times the partial changes are tested in the runtime environment,     has to be interrogated before proceeding to clear up the form.
and watch how the participant navigates trough the C# code           Although these four additions are easy to implement, they have
or Hasselt visual model.                                             to be included in the right place of a source code of 114 lines.




                                                                 28
        (a) Model given to participants                          (b) Most common solution                                  (c) Outlier’s solution
Fig. 5. The model shown in (a) was given to participants. Here the system is always in the same context-of-use and any interaction is available at any time.
The model (b) was presented as a final solution by 11 participants. The model (c) was the final solution found by the outlier, who had no previous experience
in FSM. Both types of solutions were correct.



                                                                                   SUS test scores are normalized to values between 0 and
                                                                                100. To have a benchmark to which one can compare SUS
                                                                                scores with, Lewis et al. shared historical information showing
                                                                                that the average and third quartile of 324 usability evaluations
                                                                                performed with SUS are 62.1 and 75.0 respectively [23].
                                                                                   Finally, according to a factor analysis performed by
                                                                                Lewis et al., the SUS questionnaire does not only measures
                                                                                usability. It also measures learnability, being Q4 and Q10 the
                                                                                questions that allow estimating the perceived learnability of the
            Fig. 6. Single Ease Question (SEQ) questonnaire.
                                                                                system under evaluation [23]. In the taxonomy proposed by
                                                                                Grossman et al. [24], this learnability falls within the category
(Actually, the full code contained 273 lines, but we hide the                   of initial learnability given that participants have been exposed
code for loading the speech recognizer, for hooking the mouse,                  to Hasselt for the first time during this experiment.
and the back-end functions. This was to make the comparison                     D. Interview Highlights
as fair as possible. With Hasselt, the configuration code or
the back-end code cannot be seen either. The former is within                      Based on the SEQ scores, a majority (7 out of 12 par-
Hasselt UIMS; the latter, in a canned application imported into                 ticipants) considered that the modification of Hasselt visual
Hasselt UIMS.)                                                                  models was easier than changing C# code. When asked for a
                                                                                reason, many of these participants referred to the overall view
C. Measures                                                                     provided by the visual models: “You can see all the system
                                                                                in one screen” and “You do not have to browse code through
   1) Observations: As the participant performs the required                    multiple screens” were common answers.
modifications with a certain language, the researcher monitors                     One of the few participants who scored Hasselt as more
his working time and counts the number of times the code is                     difficult than C# was the outlier seen in Figure 8, a. He pointed
tested.                                                                         out his total lack of knowledge in state machines as the cause
   2) Single Ease Question (SEQ) questionnaire: Right after                     of his poor performance. All other participants had, at least,
completing the changes with each language, participants were                    pen-and-paper experience with state machines and thus, they
asked to fill the Single Ease Question (SEQ) questionnaire, a                   could get more benefits from the training session.
7-point rating scale (Figure 6) aimed to assess the perceived
difficulty (or perceived ease, depending on one’s perspective)                  E. Results
of a task. The questionnaire has been proven to be reliable,                       All 12 participants could complete the changes with both
sensitive, and valid while also being easy to respond and easy                  languages Hasselt and C#. The data from observations and
to score [21].                                                                  post-task questionnaires are synthesized in Figure 8. After
   3) System Usability Scale (SUS) questionnaire: At the end                    inspecting the data, we decided to drop the only participant
of the experiment, participants had to fill the System Usability                who had no previous experience with FSMs. He was an outlier
Scale (SUS) questionnaire [22] (Figure 7, a), which has                         in the plots (a) and (b) shown in Figure 8. Therefore, the
become a well-known questionnaire for end-of-test subjective                    following results are based on the remaining 11 participants.
assessments of usability [23].                                                     1) Completion time: On average, changes made with Has-
   The SUS questionnaire consists of 10 items with 5-point                      selt took 2.4 minutes in comparison with the 2.1 minutes
scales numbered from 1 (anchored with “Strongly disagree”)                      when using C#. However, these results were not statistically
to 5 (anchored with “Strongly agree”).                                          significant. We could not reject the null hypothesis in favor




                                                                            29
               (a) System Usability Scale (SUS) questionnaire                           (b) Scores per question for Hasselt UIMS
Fig. 7. (a) SUS questionnaire that was filled by the 12 participants to evaluate Hasselt UIMS. (b) Participants’ responses to the SUS questionnaire. Stacked
barplots show the frequency of answers per question. The numbers at the right of the barplots indicate the average score per question.



of the alternative hypothesis that Hasselt completion times are                what it claims, or purports, to be measuring [25]. The construct
higher than C# completion times: a Wilcoxon signed-rank test                   validity of our empirical study coud have been affected as
resulted in p-value = 0.1562 > 0.05 (W = 12.5, Z = 1.3828).                    follows.
   2) Code testing effort: On average, programmers tested                         First, the code testing effort was quantified as the number of
their code 1.2 times when using Hasselt and 1.4 times when                     times the participant enters in the runtime environment. This
using C#. But this result is not statistically significant either.             means we assumed that participants have to run the program in
We could not reject the null hypothesis in favor of the                        order to test the correctness of the source code. This definition
alternative hypothesis that the code testing effort is lower with              may not be complete since it ignores the effort made when the
Hasselt than with C#: a Wilcoxon signed-rank test resulted in                  participant ‘runs and tests the code inside his head’.
p-value = 0.25 (W =0, Z = -1.4142).                                               Second, the SUS questionnaire may have been measured
   3) Perceived ease of the task: The average SEQ scores                       only certain aspects of the usability of Hasselt UIMS. An
for Hasselt and C# were 6.6 and 5.9 respectively. In this                      expert in empirical studies made us notice that usability also
case, we found that this difference in favor of Hasselt was                    includes the long-term experience of using a software system,
statistically significant. A Wilcoxon signed-rank test indicated               which is not considered in our study: all participants used
that the alternative hypothesis that the SEQ scores are higher                 Hasselt for the first and only time during the study. However,
for Hasselt than for C# can be accepted (p-value = 0.0078,                     the initial learnability, which is another dimension of the
W =28, Z = 2.6153).                                                            SUS questionnaire, was correctly measured by Q4 and Q10,
    Note: The use of Wilcoxon signed-rank tests instead of                     according to the same expert.
paired t-tests responded to the fact that we could not guarantee                  Construct validity is not the only type of validity that must
the normality assumption required by the latter. The non-                      be considered when designing empirical research. An empir-
normality of the pair differences was observed in both normal                  ical study is said to have internal validity when the impact
Q-Q plots and Shapiro-Wilk normality tests. The data analysis                  of almost all influencing factors are excluded, so the study is
was performed with the open source software R2 .                               performed in a highly controlled setting [26]. In contrast, ex-
   4) Results of the SUS questionnaire: The SUS question-                      ternal validity consists of allowing some influencing factors so
naire was only used to evaluate Hasselt UIMS. Comparing                        that the experiment can emulate a real-world situation instead
with the data repository provided by Lewis et al., the average                 of an ideal one [26]. Whereas external validity increases the
SUS score of 73.96 that the participants gave to Hasselt UIMS                  chances that results can be generalized to more realistic, every-
indicates that its perceived usability is well above average but               day situations, internal validity allows researchers to pinpoint
not higher than 75% of the 324 systems reported in [23].                       the reasons of improvement or degradation, but at the cost of
   The average scores obtained for Hasselt UIMS for each                       generalizability.
of the 10 items of the SUS questionnaires are observed in                         2) Internal validity: We pursued for internal validity in the
Figure 7, b.                                                                   following way.
F. Threats to validity                                                            First, the order of the language to be used first (i.e. Hasselt
   1) Construct validity: The general concept of validity was                  or C#) was balanced over the participants so that the aggre-
traditionally defined as the degree to which a test measures                   gated experience bias can be neutralized.
                                                                                  Besides, since the goal of the experiment was to measure
  2 https://www.r-project.org/                                                 the effort for describing multimodal dialogs, participants were




                                                                            30
                             (a)                                      (b)                                             (c)
Fig. 8. Data collected from the 12 participants. (a) Completion times, (b) Number of times the code was tested. The plot whiskers are at the lowest datum still
within 1.5 times the interquartile range (IQR) of the lower quartile, and the highest datum still within 1.5× IQR of the upper quartile. (c) Barplots showing
the frequency of each answer for the SEQ questionnaire.



restricted to this portion of the code/model only. With Hasselt,                    We expected to experience some benefits from separating
programmers were restricted to use the visual editor only.                       the event definition code from the dialog management model.
With C#, the code for configuring the speech recognizer and                      But this is what we found:
the application code (e.g. for creating, deleting objects) was                      First, we found that the better-separated Hasselt models are
hidden to programmers –we put this portion of the code in                        not faster-to-modify than equivalent event-callback code where
regions that were collapsed during the experiment.                               the instructions for event handling and for dialog management
   On the other hand, offering participants a tutorial on Hasselt                are intermixed. Although for our participants, the task of
but no tutorial on creating multimodal dialogs using C# might                    implementing a multimodal dialog was, on average, performed
affect the experiment’s internal validity.                                       faster with C# than with Hasselt, these results were not
   3) External validity: In order to confer our results with high                statistically significant.
external validity, we allow some ‘freedom’ to the experiments.                      Second, our participants tested Hasselt models fewer times
   First, the pool of participants was quite varied. It includes                 than equivalent C# code. Despite of this, completing changes
master and PhD students, post-docs, and industry program-                        with Hasselt took longer. Based on our observations, the
mers, from different universities and countries, with and                        reason for this may be that modifying visual models is more
without background in finite state machines (FSMs).                              time-consuming than writing textual code.
   Most importantly, participants were left free in the wild.                       Finally, the SEQ questionnaires revealed that participants
This contrasts with other approaches commonly used in empir-                     perceived that performing the required changes with Hasselt
ical studies, such as the think-aloud protocol and the question-                 was easier than with C#. Although these measurements turned
suggestion protocol [24]. The former would require partici-                      out to be statistically significant, we cannot discard that some
pants to speak out while programming in order to provide the                     response bias played a role here. Participants gave higher
researcher with insights about their programming logic. The                      scores to the language that led to longer completion times.
latter would allow the researcher to give advice proactively                     B. Perceived usability and initial learnability of Hasselt UIMS
to the participant. In our experiments, the researcher only                         Considering that odd-numbered questions are positively-
interferes when participants ask for questions. In our opinion,                  worded, scores higher than 3 in these items reflect that
this is a more realistic scenario that reflects the typical case                 participants agree (to a certain degree) that the evaluated
of a programmer working by his own and eventually asking                         system presents some good aspect/feature. In our study, all
for advice to more expert programmers when he got stuck on                       odd-numbered questions were scored with more than 3 points
a problem.                                                                       on average. From this group, Q3, i.e. “I thought the system
                                                                                 was easy to use” and Q7, i.e. “I would imagine that most
               V. D ISCUSSION AND C ONCLUSION                                    people would learn to use this system very quickly”, received
                                                                                 the highest scores.
A. Modeling with Hasselt and C#
                                                                                    Similarly, since even-numbered items are negatively-
   We presented Hasselt, a language that provides notations                      worded, scores lower than 3 would indicate that participants
for defining multimodal human-machine interaction dialogs.                       are disagreeing (to a certain degree) with some negative
A dialog model in Hasselt is an extended finite state machine                    comment about the system. In our studies, all even-numbered
specified with a visual editor and whose arcs are annotated                      questions were scored with less than 3 points on average. From
with multimodal events that are defined with a separate textual                  this group, Q10, “I needed to learn a lot of things before I
notation.                                                                        could get going with this system”, Q4, i.e. “I think I would




                                                                             31
need support of technical person to use this system”, and Q8,                    [4] M. Bourguet, “Designing and prototyping multimodal commands,” in
i.e. “I found the system very cumbersome to use” received the                        Proceedings of INTERACT’03, 2003, pp. 717–720.
                                                                                 [5] P. Dragicevic and J.-D. Fekete, “Support for input adaptability
lowest scores (which in this case it is something positive).                         in the icon toolkit,” in Proceedings of the 6th ICMI’04. New
   The salient scores obtained for Q4 and Q10, the two                               York, NY, USA: ACM, 2004, pp. 212–219. [Online]. Available:
questions that define perceived initial learnability [23], indicate                  http://doi.acm.org/10.1145/1027933.1027969
                                                                                 [6] J. De Boeck, D. Vanacken, C. Raymaekers, and K. Coninx, “High level
that, to a certain degree, participants consider that Hasselt                        modeling of multimodal interaction techniques using NiMMiT,” Journal
UIMS is easy-to-learn.                                                               of Virtual Reality and Broadcasting, vol. 4, no. 2, 2007.
                                                                                 [7] W. A. König, R. Rädle, and H. Reiterer, “Interactive design of multi-
C. Future work                                                                       modal user interfaces,” Journal on Multimodal User Interfaces, vol. 3,
                                                                                     no. 3, pp. 197–213, 2010.
   We think that the main reason why no clear winner emerged                     [8] J.-Y. L. Lawson, A.-A. Al-Akkad, J. Vanderdonckt, and B. Macq,
from this study is that the task was too simple given the                            “An open source workbench for prototyping multimodal interactions
programming experience of the participants. Thus, we plan                            based on off-the-shelf heterogeneous components,” in Proceedings of
                                                                                     the EICS’09. ACM, 2009, pp. 245–254.
to repeat the experiment with more complex tasks.                                [9] F. Cuenca, J. Van der Bergh, K. Luyten, and K. Coninx, “A domain-
   Other minor changes refer to the functionalities of the visual                    specific textual language for rapid prototyping of multimodal interactive
editors. We want to minimize the effort involved in wiring the                       systems,” in Proceedings of the 6th ACM SIGCHI symposium on
                                                                                     Engineering interactive computing systems (EICS’14). ACM, 2014.
FSMs. We plan to add combination keys for creating nodes and                    [10] D. A. Carr, “Specification of interface interaction objects,” in Pro-
links, not to allow resizing of the nodes, and allow jumping                         ceedings of the SIGCHI Conference on Human Factors in Computing
between the elements of a FSM with the TAB key.                                      Systems. ACM, 1994, pp. 372–378.
   Finally, we would like to gather objective cognitive load                    [11] C. Appert and M. Beaudouin-Lafon, “Swingstates: Adding state ma-
                                                                                     chines to java and the swing toolkit,” Software: Practice and Experience,
measurements [27] like heart rate or pupil dilatation. We                            vol. 38, no. 11, pp. 1149–1182, 2008.
expect to see some positive correlations between the perceived                  [12] J. Schwarz, J. Mankoff, and S. Hudson, “Monte carlo methods for
difficulty declared by participants in the questionnaires and                        managing interactive state, action and feedback under uncertainty,” in
                                                                                     Proceedings of the 24th annual ACM symposium on UIST. ACM, 2011,
their physiological reactions during the task.                                       pp. 235–244.
                                                                                [13] S. Oney, B. Myers, and J. Brandt, “Interstate: Interaction-oriented
D. Lessons learned                                                                   language primitives for expressing gui behavior,” in Proc. of UIST’14.
   Based on this experience, we suggest some guidelines for                          ACM, 2014.
                                                                                [14] J. De Boeck, C. Raymaekers, and K. Coninx, “A tool supporting model
others trying to design comparative studies between domain-                          based user interface design in 3d virtual environments,” in Grapp 2008:
specific languages and some mainstream language.                                     proceedings of the third international conference on computer graphics
   It is important that the training session can be supervised                       theory and applications, 2008, pp. 367–375.
                                                                                [15] E. Cuppens, C. Raymaekers, and K. Coninx, “{VRIXML}: A user
by the researcher and carried out right before the test. This                        interface description language for virtual environments,” 2004.
makes all participants to start the experiment with a similar                   [16] K. Kin, B. Hartmann, T. DeRose, and M. Agrawala, “Proton++: a
level of knowledge as long as they have similar backgrounds.                         customizable declarative multitouch framework,” in Proceedings of the
                                                                                     25th annual ACM symposium on User interface software and technology
Otherwise, some participants can benefit more from the train-                        (UIST’12), 2012, pp. 477–486.
ing than others, which may cause the appearance of outliers.                    [17] M. Beaudouin-Lafon, “User interface management systems: Present and
   It may not be a good idea to ask programmers working in                           future,” in From object modelling to advanced visual communication.
                                                                                     Springer, 1994, pp. 197–223.
the same research lab. Some may feel that one is going to                       [18] R. Bolt, “Put-that-there: Voice and gesture at the graphics interface,”
evaluate their programming skills. From a research lab with                          in Proceedings of the 7th annual conference on computer graphics and
more than 50 people, we could only recruit 5 participants. The                       interactive techniques (SIGGRAPH’ 80). ACM, 1980.
                                                                                [19] F. Cuenca, J. Van der Bergh, K. Luyten, and K. Coninx, “Hasselt uims:
remaining 7 participants were recruited from external institu-                       a tool for describing multimodal interactions with composite events,” in
tions. Alternatively, one can also ask a person from an external                     Proceedings of EICS’15, 2015.
institution to play the role of researcher so that participants do              [20] V. S. Alagar and K. Periyasamy, Specification of software systems.
                                                                                     Springer Science & Business Media, 2011.
not feel observed by a acquaintance or colleague.                               [21] J. Sauro and J. S. Dumas, “Comparison of three one-question, post-task
   The complexity of the programming task must be appropri-                          usability questionnaires,” in Proceedings of the SIGCHI Conference on
ately calibrated. It has to be as high as to notice differences                      Human Factors in Computing Systems. ACM, 2009, pp. 1599–1608.
                                                                                [22] J. Brooke, “Sus-a quick and dirty usability scale,” Usability evaluation
in the measurements; but not so high as to affect completion                         in industry, vol. 189, no. 194, pp. 4–7, 1996.
rates. In this matter, one must evaluate whether it is better                   [23] J. R. Lewis and J. Sauro, “The factor structure of the system usability
to ask programmers to modify an existing program or to                               scale,” in Human Centered Design. Springer, 2009, pp. 94–103.
                                                                                [24] T. Grossman, G. Fitzmaurice, and R. Attar, “A survey of software
implement a new one from scratch.                                                    learnability: metrics, methodologies and guidelines,” in Proceedings
                             R EFERENCES                                             of the SIGCHI Conference on Human Factors in Computing Systems.
                                                                                     ACM, 2009, pp. 649–658.
 [1] Y. A. Ameur and N. Kamel, “A generic formal specification of fusion        [25] J. D. Brown, The elements of language curriculum: A systematic
     of modalities in a multimodal hci,” in Building the Information Society.        approach to program development. ERIC, 1995.
     Springer, 2004.                                                            [26] J. Siegmund, N. Siegmund, and S. Apel, “Views on internal and
 [2] W. Dargie, A. Strunk, M. Winkler, B. Mrohs, S. Thakar, and W. Enkel-            external validity in empirical software engineering,” in Proceedings
     mann, “A model based approach for developing adaptive multimodal                of the 37th International Conference on Software Engineering, ICSE
     interactive systems.” in ICSOFT (PL/DPS/KE/MUSE), 2007, pp. 73–79.              2015,(to appear), 2015.
 [3] B. Dumas, D. Lalanne, and R. Ingold, “Description Languages for            [27] R. Brunken, J. L. Plass, and D. Leutner, “Direct measurement of cog-
     Multimodal Interaction: A Set of Guidelines and its Illustration with           nitive load in multimedia learning,” Educational Psychologist, vol. 38,
     SMUIML,” Journal of multimodal user interfaces, vol. 3, no. 3, pp.              no. 1, pp. 53–61, 2003.
     237–247, 2010.




                                                                            32