Comparing the Usability of two Multi-Agents
       Systems DSLs: SEA_ML++ and DSML4MAS
                      Study Design
                           João Silva∗ , Ankica Barišić∗ , Vasco Amaral∗ , Miguel Goulão∗ ,
                Baris Tekin Tezel† , Omer Faruk Alaca‡ , Moharram Challenger‡ , and Geylani Kardas‡
                          ∗ Universidade NOVA de Lisboa, NOVA LINCS, DI, FCT, Lisboa, Portugal
                                             † Dokuz Eylul University, Izmir, Turkey
                                ‡ Ege University, International Computer Institute, Izmir, Turkey

                           Email: (ji.silva | a.barisic)@campus.fct.unl.pt, (vma | mgoul)@fct.unl.pt
            baris.tezel@deu.edu.tr, omerfarukalaca@gmail.com, (moharram.challenger | geylani.kardas)@ege.edu.tr


   Abstract—Context: The “Physics of Notations” (PoN) supports         Improving the concrete syntax is very important, but we
a systematic improvement of the cognitive effectiveness of visual   should also consider the abstract syntax of a language. We
modelling languages. Problem: PoN focuses on the concrete           should be able to choose and validate the adequate language
syntax of a language, building on a predefined abstract syntax.
We should also consider the abstract syntax of a language when      constructs (concepts and their relationships) and the models
developing efforts to improve it by choosing the most adequate      (or language sentences) we can express with those. However,
language constructs (concepts and their relationships). We in-      there is a lack of guidelines reported in the literature for this
stantiate this challenge by comparing two Multi-Agent Systems       new level of assessment that could give a languages engineer
Domain Specific Languages: SEA_ML++ and DSML4MAS, and               a “recipe” for doing this sort of evaluations. These would be
assessing the extent to which their respective constructs affect
the developer experience. Method: We will perform a quasi-          valuable when developing and improving a given language.
experiment for comparing how practitioners use both languages          We are currently improving a Multi-Agent Systems (MAS)
to solve similar modelling challenges. The experiment will have     DSL: SEA_ML++ [4], [5]. In this context, we are planning an
a cross-over within-subjects design and will focus on the extent    empirical comparison with another MAS DSL: DSML4MAS
to which the different language constructs impact on developer
                                                                    [6], to assess the extent to which these DSLs respective
experience. These tasks will be monitored, so that we can
assess their success and effort involved, including eye-tracking    constructs, and combinations, affect language usability.
information. Results: This paper reports on the planned study          We will perform a quasi-experiment for comparing how
design for this empirical comparison of two DSLs for MAS.           practitioners use both languages to solve similar modelling
                                                                    challenges. We will have a cross-over within-subjects design
                      I. I NTRODUCTION                              with a focus on how the different language constructs impact
                                                                    on their usability by modellers. We will monitor these tasks to
   In the last two decades, technologies like modelling work-       assess their success and effort involved, including eye-tracking
benches made it easier to design, prototype and deploy dia-         information and a usability questionnaire (SUS [7]).
grammatic languages used more often for capturing abstrac-             This paper is organised as follows: Section II describes
tions in modelling. Extensive experience with the development       DSMLs for MAS languages as the object of our case study.
of domain-specific languages (DSLs) lead to a new discipline,       Section III presents the planned quasi-experiment, followed by
Software Languages Engineering (SLE), with the goal of mak-         a discussion in Section IV. Section V summarises this paper.
ing systematic the process of developing a software language.
   SLE follows an iterative life-cycle [1], [2] that starts with                          II. BACKGROUND
domain analysis, followed by language design, implementation
and evaluation. Unfortunately, the first and the last steps are     A. Multi-Agent Systems DSMLs
still not at a mature phase. Besides taking into account the           Software agents are autonomous entities which contain
evaluation of expressiveness of a given language, the language      intelligence that serves for solving their selfish or common
design (coverage of the language goals) needs to make use of        problems and to achieve certain goals. The study of Multi-
empirical studies to assess the language usability.                 Agent Systems (MASs) focuses on those systems in which
   The “Physics of Notations” (PoN) [3] created a valuable          many intelligent agents interact with each other. In agent-
framework to evaluate the language’s concrete syntax, and is        oriented software engineering (AOSE), the application of
extensively used to support a systematic improvement of the         model-driven development (MDD) and the use of domain-
cognitive effectiveness of visual modelling languages with a        specific modelling languages (DSMLs) for MAS development
fixed abstract syntax in some language metamodel or grammar.        are quite popular since the implementation of MAS is naturally
complex, error-prone and costly due to the autonomous and
proactive properties of the agents [8].
   In the last decade, several MAS modelling languages and
DSMLs (e.g. [4], [6], [9], [10], [11], [12]) were proposed to
support development of MASs. For example, DSML4MAS [6]
introduces a general MAS metamodel with various viewpoints
that enable the development of MAS for many application
domains. A DSL is introduced in [10] to provide a lan-
guage for the development of mobile agents. In addition, [11]
introduces a modelling language enabling the model-driven
development within the scope of Prometheus methodology for
agent development. In [4] and [13] a graphical DSML (called
SEA_ML) and textual DSL (called SEA_L) are proposed
for MAS working in semantic web environments including 8
viewpoints. MAS-ML 2.0 [12] is a modelling language which         Fig. 1. Dimensions in grey explored by the current work. Adapted from [17].
supports the MAS modelling with different agent architectures
such as: Simple Reflex Agents, Model-Based Reflex Agents,
Goal-Based Agents and Utility-Based Agents. DSML4BDI              proposed MAS DSMLs. Another MAS DSML evaluation fea-
[14] is another modelling language specific for Jason agent       ture exists in [20] for a textual DSL, called JADEL, providing
programming language. In [15], the authors propose a tool-        four abstractions, namely agents, behaviours, communication
supported development method that applies MDD techniques          ontologies, and interaction protocols to the well-known JADE
to design and implement agents based on the belief-desire-        agent development framework. However, the study evaluates
intention architecture with a sophisticated plan selection pro-   solely JADEL’s code generation performance.
cess.                                                                In recent years, we have seen several studies to identify lan-
                                                                  guage improvement opportunities, identifying problems with
B. The “Physics of Notations” (PoN)
                                                                  their concrete syntax and how they impact developer experi-
   Moody proposed the “Physics of Notations” [3] to support       ence. These studies have covered a diversity of languages,
the construction of more effective software languages. A major    including UML [16], [21], BPMN [22], [23], KAOS [24],
concern is on how to evaluate the cognitive effectiveness         [25], [26], i* [17], [27], [28], [29], OutSystems BPT [30], and
of visual languages (see, for example [16]). The framework        SEA_ML++ [5]. Some of these languages were also analysed
concentrates on the physical properties (concrete syntax) of      from the perspective of the impact of diagram layout in the
the symbols and not on their structure (abstract syntax) or       understandability of models, namely UML [31], [32], [33] and
semantics (ignoring semantics of both the ontological and         i* [34]. Other studies have compared alternative DSLs for a
language target semantic Domain). In figure 1, we present the     similar domain (e.g. Lego Mindstorms vs. Gyro [35]).
dimensions at the instance level (in grey) that are explored
by the current work. Here we study the composition of visual                         III. E XPERIMENT PLANNING
elements and its structure to form sentence instances. This         This section describes the experimental planning for this
figure is adapted from [17], where the authors refer to their     evaluation. Further details, including documentation and eval-
focus on the top left corner (Visual Notations).                  uation materials, can be found in this paper’s companion site1 .
C. Related studies                                                A. Goals
   Most of the available DS(M)Ls proposed for MASs have              Broadly, we are interested in assessing the usability of two
been evaluated by just providing a case study demonstrating       MAS DSMLs, SEA_ML++ [4], [5] and DSML4MAS [6] in
how the related language can be used for design and imple-        the context of solving modelling challenges. We use the Goal-
mentation of MAS. A quantitative analysis and/or qualitative      Question-Metric [36] template to describe our research goals:
evaluation considering e.g. the development time performance,        Our first goal (G1) is to analyse the effect of using
generation performance, and/or the usability of the language      SEA_ML++ or DSML4MAS, for the purpose of evaluation,
are not considered in these studies.                              with respect to the correctness with which a developer models
   In [18], we proposed an evaluation framework which pro-        a MAS system, from the viewpoint of researchers, in the
vides the systematic assessment of both the language con-         context of an experiment conducted with graduate students
structs and the use of agent DSMLs according to various           from Universidade Nova de Lisboa and Ege University. Our
dimensions and criteria. The study also provides an assessment    second goal (G2) is to analyse the effect of using SEA_ML++
of SEA_ML [4]. However, it does not take into account the         or DSML4MAS, for the purpose of evaluation, with respect to
effect of language constructs in the developer’s modelling        the speed with which a developer models a MAS system, from
process while using the languages. This evaluation framework
is adopted in [14], [19] and [15] for the assessment of the         1 https://sites.google.com/fct.unl.pt/hufamo2018masstudydesign/home
the viewpoint of researchers, in the context of an experiment       for contrasting SEA_ML++ with DSML4MAS in terms of their
conducted with graduate students from Universidade Nova             effect on correctness, speed, amount of rework, visual effort,
de Lisboa and Ege University. Our third goal is to analyse          and perceived usability of the languages.
the effect of using SEA_ML++ or DSML4MAS, for the
purpose of evaluation, with respect to the rework involved in            H0Correctness : Using SEA_ML++ rather than DSML4MAS
modelling a MAS system, from the viewpoint of researchers, in         does not influence the produced models correctness.
the context of an experiment conducted with graduate students            H1Correctness : Using SEA_ML++ rather than DSML4MAS
from Universidade Nova de Lisboa and Ege University.                  influences the produced models correctness.
                                                                         H0Speed : Using SEA_ML++ rather than DSML4MAS does
B. Experimental units                                                 not influence the speed of model production.
                                                                         H1Speed : Using SEA_ML++ rather than DSML4MAS influ-
   The participants in this evaluation will be professional soft-
                                                                      ences the speed of model production.
ware developers from Lisbon, and graduate students trained               H0Rework : Using SEA_ML++ rather than DSML4MAS does
in several universities, namely Universidade Nova de Lisboa,          not influence the amount of rework during model production.
Instituto Superior Técnico and Instituto Universitário de Lis-           H1Rework : Using SEA_ML++ rather than DSML4MAS influ-
boa. We will have a close replica of these evaluations with           ences the amount of rework during model production.
                                                                         H0Ef f ort : Using SEA_ML++ rather than DSML4MAS does
subjects from the Ege University, in Turkey. We will use
                                                                      not influence the visual effort involved during model produc-
convenience sampling to recruit participants, in all these sites.     tion.
Each participant will be randomly assigned to one of four                H1Ef f ort : Using SEA_ML++ rather than DSML4MAS influ-
groups, keeping a balanced sample on each of the four groups.         ences the visual effort involved during model production.
                                                                         H0U sability : Using SEA_ML++ rather than DSML4MAS does
C. Tasks                                                              not influence the perceived effort involved during model pro-
   Each subject will be asked to perform two modelling tasks:         duction.
                                                                         H1U sability : Using SEA_ML++ rather than DSML4MAS in-
one using SEA_ML++, the other DSML4MAS. The two tasks
                                                                      fluences the perceived usability of model production.
will have similar complexity and will consist in modelling a
MAS system from a natural language description of that sys-
tem. They will use an Eclipse-based editor, which is essentially       1) Assessing correctness: For each of the proposed chal-
similar. The editor only varies in the language constructs and      lenges, we have a “gold standard” model defined in both
composition rules offered to participants, depending on which       languages, with which we can compare the models built by
language is being used. The participant will make his best          our participants. The correctness of the proposed models is
to correctly model a system with each of these languages.           measured in terms of their precision, recall, and F-measure,
Regardless of the particular development task, the user will        defined here as follows:
see a split screen, with the majority of it being occupied by
                                                                      • precision – the percentage of model elements and re-
the editor, on the left side, and a smaller portion with the
                                                                        lationships in the model built by the participant that
case study the user is to model, on the right side. Figure
                                                                        correctly address the challenge (even if the participant
2 presents the starting point for performing the task with
                                                                        chose alternative ways of modelling the MAS when
SEA_ML++. Both the textual description of the model to
                                                                        compared to the “gold standard”, as long as they are
build, on the right side, and the editor, on the left, are sized
                                                                        considered correct.
so that the whole exercise can be performed without the
                                                                      • recall – the percentage of model elements and relation-
need to resizing or scrolling any window. Figure 3 presents
                                                                        ships in the “gold standard” model that are correctly
the starting point for performing the task with DSML4MAS.
                                                                        addressed by the participant’s model.
Again, window sizes will be similar, and no need to resize or
                                                                      • F-measure – a measure that combines precision and
scroll is expected. Indeed, participants will be instructed not
                                                                        recall, computed as 2∗(P   recision∗Recall)
                                                                                               (P recision+Recall) ; this measure
to change windows sizes, to increase comparability among
                                                                        provides an harmonic mean of precision and recall.
sessions. After performing both modelling tasks, participants
are asked to answer a System Usability Scale (SUS) test on             Higher values of precision, recall and the F-measure support
SEA_ML++ and DSML4MAS.                                              the claim for higher correctness, with 0 representing totally
   The tasks involve three different viewpoints: the agent          incorrect and 1 totally correct models.
viewpoint, the MasAndOrg viewpoint and the Interaction                 2) Assessing speed: We assess speed by measuring the
viewpoint. For the sake of illustration, we provide here the        amount of time (measured in seconds) taken by our participant
agent viewpoint, in both languages, Figure 4 (SEA_ML++)             to build a MAS model. Lower values of this metric support
and Figure 5 (DSML4MAS). Further materials, including               the claim of better language usage efficiency.
large-sized versions of these diagrams can be found in our
                                                                       3) Assessing rework: We assess rework by identifying,
companion site.
                                                                    through the analysis of the model building screencast, the
D. Hypotheses, parameters and variables                             moments where the participant discarded parts of the solution
   For each of our high-level goals, we define the null (H0 ) and   he was building (e.g. by removing a previously added element,
alternative (H1 ) hypotheses. Similar hypotheses can be written     or relationship).
                                    Fig. 2. Environment for performing the SEA_ML++ modelling task.


                                    Fig. 3. Environment for performing the DSML4MAS modelling task.


   4) Assessing visual effort: We assess visual effort using eye     Higher values support the claim for a better usability.
tracking data collected through the screencast. In particular,
                                                                     E. Design
we will analyse heat-maps of the screencasts to compare,
for example, whether there are significant differences in the           Table I outlines our cross-over within subjects design, with
amount of time spent exploring modelling options available in        two challenges from different domains (D1 and D2), but with
the language toolbar and whether there is some relationship          a similar complexity. Each participant will solve those two
between these exploring moments and patterns of rework.              challenges using a different language in each of them. To
                                                                     cancel learning effects, we will balance the number of times
  5) Assessing the perceived usability: We assess the per-           the participants start with each of the languages and each of
ceived usability through an SUS questionnaire which provides         the problems. In other words, we will balance the participants
a SUS score from 0 to 100, with an average value of 68 [7].          in groups A, B, C and D.
Fig. 4. SEA_ML++ agent viewpoint possible solution.


Fig. 5. DSML4MAS agent viewpoint possible solution
                                TABLE I                                       assessment of those solutions in a process which is somewhat
                E XPERIMENTAL DESIGN AND TASKS SEQUENCE                       similar to grading the result of a modelling exercise, in an
                                                                              academic context, following the criteria detailed in section
Gr      Ltr   Dem     Tut    Cal     Challenge 1        Challenge 2     SUS
                                                                              III-D1. We will then compute descriptive statistics for the
A       X     X       X      X       D1 / SEA_ML++      D2 / DSML4MAS   X
B       X     X       X      X       D1 / DSML4MAS      D2 / SEA_ML++   X     collected metrics and test for significant differences between
C       X     X       X      X       D2 / SEA_ML++      D1 / DSML4MAS   X     the level of correctness achieved with each language.
D       X     X       X      X       D2 / DSML4MAS      D1 / SEA_ML++   X
                                                                                 2) Speed: The data concerning speed will be collected
                                                                              during the visual inspection of the screencast of the sessions,
                                                                              by annotating the timestamps marking the begin and the end of
F. Procedure
                                                                              each task. We will then compute descriptive statistics for the
   As depicted in Table I, before starting, each participant                  collected metrics and test for significant differences between
will sign a letter of consent, adapted from [37] and fill in                  the duration of the tasks using each language.
a demographic questionnaire, so that we record information                       3) Rework: The data concerning rework will be collected
about our participants, including country, age, genre, academic               through visual inspection of the screencast. In particular, we
level, previous experience with MAS and, in particular, with                  will collect and annotate with timestamps events of creation,
each of the two analysed languages. This is followed by                       deletion, or update of model elements and associations among
viewing a short tutorial on both languages. Then, the subject                 those elements. This will provide us with a timeline of the
will perform an eye tracking device calibration, so that the eye              model construction process for further analysis. Concerning
tracking data of the session can be recorded with precision. To               rework, we will analyse activities that undo previous work (e.g.
maximise eye tracking recording precision, participants will                  a model element that was previously added to the model and
be comfortably seated at a distance of about 60cm from a                      now is deleted). This will allow identifying when the partici-
full HD 22 inch monitor and instructed not to move much                       pant is convinced he made a mistake and decides to backtrack.
during the whole session. An EyeTribe eye tracker 2 will be                   Ultimately, we will explore whether the different languages
placed below the monitor. The participant will also have a                    lead to different levels of rework, both in general, and with
keyboard and a mouse, to be able to build a MAS model.                        particular sub-groups of participants, divided according to their
After these preparatory tasks, the experiment itself can start.               background (e.g. by level of expertise with MAS).
During the whole session, a screencast of the contents of the                    4) Visual effort: The eye tracking data is collected automat-
screen will be recorded. Furthermore, eye tracking data will                  ically during the execution of the experiment. This produces
also be collected, in sync with the screencast of the session.                a time series of eye tracking events, namely fixations and
The participant will have no time limit to finish his task, but               saccades, with their duration, location, etc. The screen area
our pilot sessions point to a duration of about 20 minutes to                 will be annotated with relevant areas of interest, so that we
perform the given tasks. Finally, the subject answers a SUS                   can use the eye tracking data to monitor how each participant
test [7], so that we may contrast his opinions on the usability               navigated through those areas, during the process. We will use
of SEA_ML++ and of DSML4MAS.                                                  custom-made tools from the NOVA LINCS team to support
G. Analysis procedure                                                         this analysis. In the end, we expect to use heat maps to analyse
                                                                              where the most important focuses of visual attention were, and
   The data collected during the experiment sessions will be
                                                                              scanpath analysis to better understand the model navigation
analysed using a combination of automated data collection for
                                                                              strategies of our participants.
the questionnaires and eye tracking data, with manual data col-                  5) Perceived usability: We will assess usability through
lection, combining the visual inspection of the screencast with               a SUS test. The SUS instrument is available in the testing
the synchronised recorded audio of the think aloud protocol.                  environment as a web form. The collected data will be directly
Concerning descriptive statistics, we will normally collect the               fed into SPSS so that we may proceed with the comparative
following ones, adjusting the actual set of descriptive statistics            analysis of the distributions of the usability scores.
to the scale type (nominal, ordinal, interval or ratio) of each
variable: number of cases, mean, median, mode, standard                                              IV. D ISCUSSION
deviation, skewness, kurtosis, the p-value of the Shapiro-Wilk                A. Expected results and implications
normality test. We will then use appropriate statistics tests.
                                                                                 We are interested in assessing how the usability is influenced
For example, we plan to use the Welch t test, which is a more
                                                                              by the selection of one of these languages over the other.
robust alternative to the t-test [38]) to compare the distributions
                                                                              Rather than using these results as a way of promoting the
of correctness obtained with SEA_ML++ vs. DSML4MAS.
                                                                              usage of one of the languages, our goal is to identify language
The statistics analysis will be run using SPSS 3 .
                                                                              improvement opportunities, on the one hand, and learning
   1) Correctness: The data concerning correctness will be
                                                                              from the “competition”, on the other. This process is, in that
collected through visual inspection of the solutions created
                                                                              sense, similar to the one the NOVA LINCS team has followed
by the participants in our study. This implies a qualitative
                                                                              for supporting the Gyro language evolution [35] through a
     2 http://www.theeyetribe.com/                                            series of developer experience evaluations. We have advocated
     3 https://www.ibm.com/analytics/spss-statistics-software                 elsewhere [1] that software language development should
be iterative and incremental, including (possibly lightweight)            3) External validity: Our participants will not have, in gen-
evaluations after each iteration, so that improvement oppor-           eral, much experience with MAS and with the two languages.
tunities are identified as soon as possible, and, when feasible        As such, our participants are better representatives of devel-
and adequate, followed on in the next version of the language.         opers who are learning these languages. Further research is
   Apart from the more “traditional” analysis of effectiveness,        necessary to assess how these languages compare, when used
here regarded from the perspective of correctness, and ef-             by modellers who are experienced with the two languages. The
ficiency, viewed considering the speed, we expect our ex-              conclusions of this study will be applicable to these two MAS
ploratory study on the process of building the models, with            DSMLs. Replications with other languages, not necessarily
an analysis of the time annotated sequences of insertions,             for MAS, are required before we can generalise this study’s
deletions and changes while constructing models to provide             conclusions to other contexts.
us insights on the main bottlenecks language users experience             4) Construct validity: After watching a short tutorial about
during the model building process and, conversely, where they          both languages, participants will solve a couple of challenges,
seem to experience less difficulties. The eye tracking data is         one with each language. This may cause an evaluation appre-
expected to provide further context for better identifying lan-        hension threat. We mitigate this by informing participants that
guage improvement opportunities. In a longer run, the lessons          the languages are being evaluated, not the participants. The
learned in this and similar studies have the potential for help-       experimental process is built so that we express no bias toward
ing us designing more usable software modelling languages.             any of the languages, to mitigate the risk of accidentally
This will also help us better understanding how people from            favouring SEA_ML++. Our goal is to identify opportunities
different backgrounds interact with each modelling language,           to improve SEA_ML++ rather than the comparison with
building on earlier works that explored how different personal         DSML4MAS itself. Our measures to mitigate this risk include
characteristics (e.g. gender) impacted on the learning, problem        choosing for the author of the recorded tutorials someone with
solving and information processing style [39]. Finally, the SUS        no vested interest in any of the languages and doing the same
usability questionnaire will help us better understanding how          for the researchers performing the data analysis. Further, in the
the differences between both languages impact usability.               interest of transparency and replicability, the data used in these
                                                                       evaluations and data analysis scripts for SPSS will be made
B. Threats to validity                                                 publicly available. Last, but not the least, this paper discussing
   1) Conclusion validity: Although we plan to have a reason-          the experimental design to be used in this evaluation serves as
able amount of participants (over 30), considering the nature of       a manifest of interest in performing this particular experiment.
this study, sample size is a likely threat, due to the difficulty in   This creates an opportunity for a sanity check, where the initial
recruiting participants. Our mitigation strategy is to have two        goals of this study will be directly comparable with what is
teams performing the study in two different countries. The             actually tested in the experiment, and reported later, mitigating
exercise of preparing the experimental replication package so          the potential for selective publishing, where only favourable
that it can be run both in Portugal and Turkey will help us            results would be published.
fine tune it making the package more reusable to third-party
replications. This will directly mitigate the sample size risk,                                 V. S UMMARY
as we will have participants in both countries, and indirectly,
                                                                          We presented the experimental planning for the evaluation
by facilitating potential third-party replications.
                                                                       of the case study of DSLs for Multi-agents Systems. Our goal
   2) Internal validity: There is a potential learning effect
                                                                       is to go beyond the evaluation of the language’s notations
from solving one challenge to the next. We mitigate this risk
                                                                       (concrete syntax) and evaluate the constructs composition at
by having the crossover design so that half of the partici-
                                                                       the level of the instance sentence level (abstract syntax).
pants start with a SEA_ML++ model while the other starts
                                                                          It is expected that the results of the evaluation planned
with DSML4MAS. Another threat could be that a particular
                                                                       in this paper will help in identifying effective improvement
problem would by accident favour one of the languages. To
                                                                       opportunities for the developer experience with SEA_ML++.
mitigate it, both problems will be modelled in both languages,
                                                                          The work triggers future research in that it departs from the
by different participants. We chose two languages for which
                                                                       more commonly explored part of visual modelling languages
the tool support is at a similar level, and with a close look
                                                                       (their visual notation) to other relevant perspectives, namely
and feel, so that tooling does not play a role in differentiating
                                                                       at the instance (sentence) level.
among the two DSLs. We also made efforts so that all
materials were easily readable in a 22 inch monitor and that
                                                                                            ACKNOWLEDGMENT
the models to be developed would fit nicely in a canvas on
this kind of monitor, without requiring the user to scroll or            The authors would like to thank the following: i) the
zoom the image. Monitor size and the general layout for the            Scientific and Technological Research Council of Turkey
experiment, including the distance of the participant to the           (TUBITAK) under grant 115E591, and ii) Portuguese grants
monitor were constrained by the technical specifications of            NOVA LINCS Research Laboratory (Grant: FCT/MCTES
our eye tracking device. In spite of these constraints, the tasks      PEst UID/ CEC/04516/2013) and DSML4MAS Project
are already challenging to our participants.                           (Grant: FCT/MCTES TUBITAK/0008/2014).
  The authors would also like to thank the COST Action net-
                                                                                  [21] A. El Kouhen, A. Gherbi, C. Dumoulin, and F. Khendek, “On the
working mechanisms and support of IC1404 Multi-Paradigm                                semantic transparency of visual notations: Experiments with uml,” in
Modeling for Cyber-Physical Systems (MPM4CPS). COST is                                 International SDL Forum. Springer, 2015, pp. 122–137.
supported by the EU Framework Programme Horizon 2020.                             [22] N. Genon, P. Heymans, and D. Amyot, “Analysing the cognitive effec-
                                                                                       tiveness of the bpmn 2.0 visual notation,” in Proceedings of the Third
                              R EFERENCES                                              International Conference on Software Language Engineering, 2010, pp.
                                                                                       377–396.
 [1] A. Barisic, V. Amaral, and M. Goulão, “Usability driven DSL devel-
                                                                                  [23] D. L. Moody, “Why a diagram is only sometimes worth
     opment with USE-ME,” Computer Languages, Systems & Structures,
                                                                                       a thousand words: An analysis of the bpmn 2.0 visual
     vol. 51, pp. 118–157, 2018.
                                                                                       notation,” Hämtat 2012-06-19 från http://www. business. uq.
 [2] M. Mernik, J. Heering, and A. M. Sloane, “When and how to develop
                                                                                       edu. au/sites/default/files/event/supportingD ocs/Analysis% 20of%
     domain-specific languages,” ACM Comput. Surv., vol. 37, no. 4, pp.
                                                                                       20BPMN% 202.0% 20Visual% 20Syntax. pdf, Tech. Rep., 2011.
     316–344, 2005.
 [3] D. Moody, “The “physics” of notations: toward a scientific basis for         [24] R. Matulevičius and P. Heymans, “Visually effective goal models using
     constructing visual notations in software engineering,” IEEE T Software           kaos,” in International Conference on Conceptual Modeling. Springer,
     Eng, vol. 35, no. 6, pp. 756–779, 2009.                                           2007, pp. 265–275.
 [4] M. Challenger, S. Demirkol, S. Getir, M. Mernik, G. Kardas, and              [25] R. Matulevicius and P. Heymans, “Comparing goal modelling languages:
     T. Kosar, “On the use of a domain-specific modeling language in the               An experiment,” in International Working Conference on Requirements
     development of multiagent systems,” Eng Appl Artif Intel, vol. 28, pp.            Engineering: Foundation for Software Quality, 2007, pp. 18–32.
     111–141, 2014.
                                                                                  [26] M. Santos, C. Gralha, M. Goulão, and J. a. Araujo, “Increasing the
 [5] T. Miranda, M. Challenger, B. T. Tezel, O. F. Alaca, V. Amaral,
                                                                                       semantic transparency of the kaos goal model concrete syntax,” in 37th
     M. Goulão, and G. Kardas, “Improving the usability of a mas dsml,” in
                                                                                       International Conference on Conceptual Modeling (ER 2018). Xi’an,
     6th International Workshop on Engineering Multi-Agent Systems (EMAS
                                                                                       China: Springer, October, 22–25 2018.
     2018). Stockholm, Sweden: Springer, July, 14 2018.
 [6] C. Hahn, “A domain specific modeling language for multiagent systems,”       [27] P. Caire, N. Genon, P. Heymans, and D. L. Moody, “Visual notation
     in Proceedings of the 7th international joint conference on Autonomous            design 2.0: Towards user comprehensible requirements engineering
     agents and multiagent systems-Volume 1, 2008, pp. 233–240.                        notations,” in RE’13. IEEE, 2013, pp. 115–124.
 [7] J. Brooke et al., “Sus-a quick and dirty usability scale,” Usability         [28] N. Genon, P. Caire, H. Toussaint, P. Heymans, and D. Moody, “Towards
     evaluation in industry, vol. 189, no. 194, pp. 4–7, 1996.                         a more semantically transparent i* visual syntax,” in International Work-
 [8] G. Kardas and J. J. Gomez-Sanz, “Special issue on model-driven                    ing Conference on Requirements Engineering: Foundation for Software
     engineering of multi-agent systems in theory and practice,” Comput Lang           Quality, 2012, pp. 140–146.
     Syst Str, vol. 50, pp. 140–141, 2017.
                                                                                  [29] M. Santos, C. Gralha, M. Goulão, J. a. Araujo, and A. Moreira, “On the
 [9] G. Beydoun, G. Low, B. Henderson-Sellers, H. Mouratidis, J. J. Gomez-
                                                                                       impact of semantic transparency on understanding and reviewing social
     Sanz, J. Pavon, and C. Gonzalez-Perez, “Faml: a generic metamodel for
                                                                                       goal models,” in 26th IEEE International Conference on Requirements
     mas development,” IEEE T Software Eng, vol. 35, no. 6, pp. 841–863,
                                                                                       Engineering (RE 2018). Banff, Canada: IEEE, August, 20–24 2018.
     2009.
[10] G. Ciobanu and C. Juravle, “Flexible software architecture and language      [30] H. Henriques, H. Lourenço, V. Amaral, and M. Goulão, “Improving the
     for mobile agents,” Concurr Comp-Pract E, vol. 24, no. 6, pp. 559–571,            developer experience with a low-code process modelling language,” in
     2012.                                                                             ACM/IEEE 21st International Conference on Model Driven Engineering
[11] J. M. Gascueña, E. Navarro, and A. Fernández-Caballero, “Model-driven             Languages and Systems (MODELS). Copenhagen, Denmark: ACM,
     engineering techniques for the development of multi-agent systems,” Eng           October 2018.
     Appl Artif Intel, vol. 25, no. 1, pp. 159–173, 2012.                         [31] H. Störrle, “On the impact of layout quality to understanding uml dia-
[12] E. J. T. Gonçalves, M. I. Cortés, G. A. L. Campos, Y. S. Lopes, E. S.             grams,” in Visual Languages and Human-Centric Computing (VL/HCC),
     Freire, V. T. da Silva, K. S. F. de Oliveira, and M. A. de Oliveira, “Mas-        2011 IEEE Symposium on. IEEE, 2011, pp. 135–142.
     ml 2.0: Supporting the modelling of multi-agent systems with different
                                                                                  [32] H. Storrle, “On the impact of layout quality to understanding uml dia-
     agent architectures,” J Syst Software, vol. 108, pp. 77–109, 2015.
                                                                                       grams: Diagram type and expertise,” in Visual Languages and Human-
[13] S. Demirkol, M. Challenger, S. Getir, T. Kosar, G. Kardas, and
                                                                                       Centric Computing (VL/HCC), 2012 IEEE Symposium on. IEEE, 2012,
     M. Mernik, “A dsl for the development of software agents working
                                                                                       pp. 49–56.
     within a semantic web environment,” Computer Science and Information
     Systems, vol. 10, no. 4, pp. 1525–1556, 2013.                                [33] H. Störrle, “On the impact of layout quality to understanding uml
[14] G. Kardas, B. T. Tezel, and M. Challenger, “Domain-specific modelling             diagrams: size matters,” in International Conference on Model Driven
     language for belief-desire-intention software agents,” IET Softw, vol. 12,        Engineering Languages and Systems. Springer, 2014, pp. 518–534.
     no. 4, pp. 356–364, 2018.                                                    [34] M. Santos, C. Gralha, M. Goulão, J. Araújo, A. Moreira, and J. Cam-
[15] J. Faccin and I. Nunes, “A tool-supported development method for                  beiro, “What is the impact of bad layout in the understandability of social
     improved bdi plan selection,” Engineering Applications of Artificial              goal models?” in 24th IEEE International Requirements Engineering
     Intelligence, vol. 62, pp. 195–213, 2017.                                         Conference (RE’16). Beijing, China: IEEE, September, 12–16 2016.
[16] D. Moody and J. van Hillegersberg, “Evaluating the visual syntax of
                                                                                  [35] A. Barišić, J. Cambeiro, V. Amaral, M. Goulão, and T. Mota, “Lever-
     uml: An analysis of the cognitive effectiveness of the uml family of dia-
                                                                                       aging teenagers feedback in the development of a domain-specific
     grams,” in International Conference on Software Language Engineering.
                                                                                       language: the case of programming low-cost robots,” in Proceedings
     Springer, 2008, pp. 16–34.
                                                                                       of the 33rd Annual ACM Symposium on Applied Computing. ACM,
[17] D. L. Moody, P. Heymans, and R. Matulevičius, “Visual syntax does
                                                                                       2018, pp. 1221–1229.
     matter: improving the cognitive effectiveness of the i* visual notation,”
     Requir Eng, vol. 15, no. 2, pp. 141–175, 2010.                               [36] V. Basili, G. Caldiera, and H. Rombach, “Goal Question Metric
[18] M. Challenger, G. Kardas, and B. Tekinerdogan, “A systematic ap-                  Paradigm,” Encyclopedia of Software Eng., vol. 1, pp. 528–532, 2001.
     proach to evaluating domain-specific modeling language environments          [37] P. Runeson, M. Host, A. Rainer, and B. Regnell, Case study research
     for multi-agent systems,” Software Qual J, vol. 24, no. 3, pp. 755–795,           in software engineering: Guidelines and examples. Wiley, 2012.
     Sep. 2016.
                                                                                  [38] B. L. Welch, “The generalization of ‘student’s’ problem when
[19] G. Kardas, E. Bircan, and M. Challenger, “Supporting the platform
                                                                                       several different population variances are involved,” Biometrika,
     extensibility for the model-driven development of agent systems by the
                                                                                       vol. 34, no. 1-2, pp. 28–35, 1947. [Online]. Available: http:
     interoperability between domain-specific modeling languages of multi-
                                                                                       //dx.doi.org/10.1093/biomet/34.1-2.28
     agent systems,” Comput Sci Inf Syst, vol. 14, no. 3, pp. 875–912, 2017.
[20] F. Bergenti, E. Iotti, S. Monica, and A. Poggi, “Agent-oriented model-       [39] L. Beckwith and M. Burnett, “Gender: An important factor in end-user
     driven development for jade with the jadel programming language,”                 programming environments?” in Visual Languages and Human Centric
     Comput Lang Syst Str, vol. 50, pp. 142–158, 2017.                            Computing, 2004 IEEE Symposium on. IEEE, 2004, pp. 107–114.