Comparing the Usability of two Multi-Agents Systems DSLs: SEA_ML++ and DSML4MAS Study Design João Silva∗ , Ankica Barišić∗ , Vasco Amaral∗ , Miguel Goulão∗ , Baris Tekin Tezel† , Omer Faruk Alaca‡ , Moharram Challenger‡ , and Geylani Kardas‡ ∗ Universidade NOVA de Lisboa, NOVA LINCS, DI, FCT, Lisboa, Portugal † Dokuz Eylul University, Izmir, Turkey ‡ Ege University, International Computer Institute, Izmir, Turkey Email: (ji.silva | a.barisic)@campus.fct.unl.pt, (vma | mgoul)@fct.unl.pt baris.tezel@deu.edu.tr, omerfarukalaca@gmail.com, (moharram.challenger | geylani.kardas)@ege.edu.tr Abstract—Context: The “Physics of Notations” (PoN) supports Improving the concrete syntax is very important, but we a systematic improvement of the cognitive effectiveness of visual should also consider the abstract syntax of a language. We modelling languages. Problem: PoN focuses on the concrete should be able to choose and validate the adequate language syntax of a language, building on a predefined abstract syntax. We should also consider the abstract syntax of a language when constructs (concepts and their relationships) and the models developing efforts to improve it by choosing the most adequate (or language sentences) we can express with those. However, language constructs (concepts and their relationships). We in- there is a lack of guidelines reported in the literature for this stantiate this challenge by comparing two Multi-Agent Systems new level of assessment that could give a languages engineer Domain Specific Languages: SEA_ML++ and DSML4MAS, and a “recipe” for doing this sort of evaluations. These would be assessing the extent to which their respective constructs affect the developer experience. Method: We will perform a quasi- valuable when developing and improving a given language. experiment for comparing how practitioners use both languages We are currently improving a Multi-Agent Systems (MAS) to solve similar modelling challenges. The experiment will have DSL: SEA_ML++ [4], [5]. In this context, we are planning an a cross-over within-subjects design and will focus on the extent empirical comparison with another MAS DSL: DSML4MAS to which the different language constructs impact on developer [6], to assess the extent to which these DSLs respective experience. These tasks will be monitored, so that we can assess their success and effort involved, including eye-tracking constructs, and combinations, affect language usability. information. Results: This paper reports on the planned study We will perform a quasi-experiment for comparing how design for this empirical comparison of two DSLs for MAS. practitioners use both languages to solve similar modelling challenges. We will have a cross-over within-subjects design I. I NTRODUCTION with a focus on how the different language constructs impact on their usability by modellers. We will monitor these tasks to In the last two decades, technologies like modelling work- assess their success and effort involved, including eye-tracking benches made it easier to design, prototype and deploy dia- information and a usability questionnaire (SUS [7]). grammatic languages used more often for capturing abstrac- This paper is organised as follows: Section II describes tions in modelling. Extensive experience with the development DSMLs for MAS languages as the object of our case study. of domain-specific languages (DSLs) lead to a new discipline, Section III presents the planned quasi-experiment, followed by Software Languages Engineering (SLE), with the goal of mak- a discussion in Section IV. Section V summarises this paper. ing systematic the process of developing a software language. SLE follows an iterative life-cycle [1], [2] that starts with II. BACKGROUND domain analysis, followed by language design, implementation and evaluation. Unfortunately, the first and the last steps are A. Multi-Agent Systems DSMLs still not at a mature phase. Besides taking into account the Software agents are autonomous entities which contain evaluation of expressiveness of a given language, the language intelligence that serves for solving their selfish or common design (coverage of the language goals) needs to make use of problems and to achieve certain goals. The study of Multi- empirical studies to assess the language usability. Agent Systems (MASs) focuses on those systems in which The “Physics of Notations” (PoN) [3] created a valuable many intelligent agents interact with each other. In agent- framework to evaluate the language’s concrete syntax, and is oriented software engineering (AOSE), the application of extensively used to support a systematic improvement of the model-driven development (MDD) and the use of domain- cognitive effectiveness of visual modelling languages with a specific modelling languages (DSMLs) for MAS development fixed abstract syntax in some language metamodel or grammar. are quite popular since the implementation of MAS is naturally complex, error-prone and costly due to the autonomous and proactive properties of the agents [8]. In the last decade, several MAS modelling languages and DSMLs (e.g. [4], [6], [9], [10], [11], [12]) were proposed to support development of MASs. For example, DSML4MAS [6] introduces a general MAS metamodel with various viewpoints that enable the development of MAS for many application domains. A DSL is introduced in [10] to provide a lan- guage for the development of mobile agents. In addition, [11] introduces a modelling language enabling the model-driven development within the scope of Prometheus methodology for agent development. In [4] and [13] a graphical DSML (called SEA_ML) and textual DSL (called SEA_L) are proposed for MAS working in semantic web environments including 8 viewpoints. MAS-ML 2.0 [12] is a modelling language which Fig. 1. Dimensions in grey explored by the current work. Adapted from [17]. supports the MAS modelling with different agent architectures such as: Simple Reflex Agents, Model-Based Reflex Agents, Goal-Based Agents and Utility-Based Agents. DSML4BDI proposed MAS DSMLs. Another MAS DSML evaluation fea- [14] is another modelling language specific for Jason agent ture exists in [20] for a textual DSL, called JADEL, providing programming language. In [15], the authors propose a tool- four abstractions, namely agents, behaviours, communication supported development method that applies MDD techniques ontologies, and interaction protocols to the well-known JADE to design and implement agents based on the belief-desire- agent development framework. However, the study evaluates intention architecture with a sophisticated plan selection pro- solely JADEL’s code generation performance. cess. In recent years, we have seen several studies to identify lan- guage improvement opportunities, identifying problems with B. The “Physics of Notations” (PoN) their concrete syntax and how they impact developer experi- Moody proposed the “Physics of Notations” [3] to support ence. These studies have covered a diversity of languages, the construction of more effective software languages. A major including UML [16], [21], BPMN [22], [23], KAOS [24], concern is on how to evaluate the cognitive effectiveness [25], [26], i* [17], [27], [28], [29], OutSystems BPT [30], and of visual languages (see, for example [16]). The framework SEA_ML++ [5]. Some of these languages were also analysed concentrates on the physical properties (concrete syntax) of from the perspective of the impact of diagram layout in the the symbols and not on their structure (abstract syntax) or understandability of models, namely UML [31], [32], [33] and semantics (ignoring semantics of both the ontological and i* [34]. Other studies have compared alternative DSLs for a language target semantic Domain). In figure 1, we present the similar domain (e.g. Lego Mindstorms vs. Gyro [35]). dimensions at the instance level (in grey) that are explored by the current work. Here we study the composition of visual III. E XPERIMENT PLANNING elements and its structure to form sentence instances. This This section describes the experimental planning for this figure is adapted from [17], where the authors refer to their evaluation. Further details, including documentation and eval- focus on the top left corner (Visual Notations). uation materials, can be found in this paper’s companion site1 . C. Related studies A. Goals Most of the available DS(M)Ls proposed for MASs have Broadly, we are interested in assessing the usability of two been evaluated by just providing a case study demonstrating MAS DSMLs, SEA_ML++ [4], [5] and DSML4MAS [6] in how the related language can be used for design and imple- the context of solving modelling challenges. We use the Goal- mentation of MAS. A quantitative analysis and/or qualitative Question-Metric [36] template to describe our research goals: evaluation considering e.g. the development time performance, Our first goal (G1) is to analyse the effect of using generation performance, and/or the usability of the language SEA_ML++ or DSML4MAS, for the purpose of evaluation, are not considered in these studies. with respect to the correctness with which a developer models In [18], we proposed an evaluation framework which pro- a MAS system, from the viewpoint of researchers, in the vides the systematic assessment of both the language con- context of an experiment conducted with graduate students structs and the use of agent DSMLs according to various from Universidade Nova de Lisboa and Ege University. Our dimensions and criteria. The study also provides an assessment second goal (G2) is to analyse the effect of using SEA_ML++ of SEA_ML [4]. However, it does not take into account the or DSML4MAS, for the purpose of evaluation, with respect to effect of language constructs in the developer’s modelling the speed with which a developer models a MAS system, from process while using the languages. This evaluation framework is adopted in [14], [19] and [15] for the assessment of the 1 https://sites.google.com/fct.unl.pt/hufamo2018masstudydesign/home the viewpoint of researchers, in the context of an experiment for contrasting SEA_ML++ with DSML4MAS in terms of their conducted with graduate students from Universidade Nova effect on correctness, speed, amount of rework, visual effort, de Lisboa and Ege University. Our third goal is to analyse and perceived usability of the languages. the effect of using SEA_ML++ or DSML4MAS, for the purpose of evaluation, with respect to the rework involved in H0Correctness : Using SEA_ML++ rather than DSML4MAS modelling a MAS system, from the viewpoint of researchers, in does not influence the produced models correctness. the context of an experiment conducted with graduate students H1Correctness : Using SEA_ML++ rather than DSML4MAS from Universidade Nova de Lisboa and Ege University. influences the produced models correctness. H0Speed : Using SEA_ML++ rather than DSML4MAS does B. Experimental units not influence the speed of model production. H1Speed : Using SEA_ML++ rather than DSML4MAS influ- The participants in this evaluation will be professional soft- ences the speed of model production. ware developers from Lisbon, and graduate students trained H0Rework : Using SEA_ML++ rather than DSML4MAS does in several universities, namely Universidade Nova de Lisboa, not influence the amount of rework during model production. Instituto Superior Técnico and Instituto Universitário de Lis- H1Rework : Using SEA_ML++ rather than DSML4MAS influ- boa. We will have a close replica of these evaluations with ences the amount of rework during model production. H0Ef f ort : Using SEA_ML++ rather than DSML4MAS does subjects from the Ege University, in Turkey. We will use not influence the visual effort involved during model produc- convenience sampling to recruit participants, in all these sites. tion. Each participant will be randomly assigned to one of four H1Ef f ort : Using SEA_ML++ rather than DSML4MAS influ- groups, keeping a balanced sample on each of the four groups. ences the visual effort involved during model production. H0U sability : Using SEA_ML++ rather than DSML4MAS does C. Tasks not influence the perceived effort involved during model pro- Each subject will be asked to perform two modelling tasks: duction. H1U sability : Using SEA_ML++ rather than DSML4MAS in- one using SEA_ML++, the other DSML4MAS. The two tasks fluences the perceived usability of model production. will have similar complexity and will consist in modelling a MAS system from a natural language description of that sys- tem. They will use an Eclipse-based editor, which is essentially 1) Assessing correctness: For each of the proposed chal- similar. The editor only varies in the language constructs and lenges, we have a “gold standard” model defined in both composition rules offered to participants, depending on which languages, with which we can compare the models built by language is being used. The participant will make his best our participants. The correctness of the proposed models is to correctly model a system with each of these languages. measured in terms of their precision, recall, and F-measure, Regardless of the particular development task, the user will defined here as follows: see a split screen, with the majority of it being occupied by • precision – the percentage of model elements and re- the editor, on the left side, and a smaller portion with the lationships in the model built by the participant that case study the user is to model, on the right side. Figure correctly address the challenge (even if the participant 2 presents the starting point for performing the task with chose alternative ways of modelling the MAS when SEA_ML++. Both the textual description of the model to compared to the “gold standard”, as long as they are build, on the right side, and the editor, on the left, are sized considered correct. so that the whole exercise can be performed without the • recall – the percentage of model elements and relation- need to resizing or scrolling any window. Figure 3 presents ships in the “gold standard” model that are correctly the starting point for performing the task with DSML4MAS. addressed by the participant’s model. Again, window sizes will be similar, and no need to resize or • F-measure – a measure that combines precision and scroll is expected. Indeed, participants will be instructed not recall, computed as 2∗(P recision∗Recall) (P recision+Recall) ; this measure to change windows sizes, to increase comparability among provides an harmonic mean of precision and recall. sessions. After performing both modelling tasks, participants are asked to answer a System Usability Scale (SUS) test on Higher values of precision, recall and the F-measure support SEA_ML++ and DSML4MAS. the claim for higher correctness, with 0 representing totally The tasks involve three different viewpoints: the agent incorrect and 1 totally correct models. viewpoint, the MasAndOrg viewpoint and the Interaction 2) Assessing speed: We assess speed by measuring the viewpoint. For the sake of illustration, we provide here the amount of time (measured in seconds) taken by our participant agent viewpoint, in both languages, Figure 4 (SEA_ML++) to build a MAS model. Lower values of this metric support and Figure 5 (DSML4MAS). Further materials, including the claim of better language usage efficiency. large-sized versions of these diagrams can be found in our 3) Assessing rework: We assess rework by identifying, companion site. through the analysis of the model building screencast, the D. Hypotheses, parameters and variables moments where the participant discarded parts of the solution For each of our high-level goals, we define the null (H0 ) and he was building (e.g. by removing a previously added element, alternative (H1 ) hypotheses. Similar hypotheses can be written or relationship). Fig. 2. Environment for performing the SEA_ML++ modelling task. Fig. 3. Environment for performing the DSML4MAS modelling task. 4) Assessing visual effort: We assess visual effort using eye Higher values support the claim for a better usability. tracking data collected through the screencast. In particular, E. Design we will analyse heat-maps of the screencasts to compare, for example, whether there are significant differences in the Table I outlines our cross-over within subjects design, with amount of time spent exploring modelling options available in two challenges from different domains (D1 and D2), but with the language toolbar and whether there is some relationship a similar complexity. Each participant will solve those two between these exploring moments and patterns of rework. challenges using a different language in each of them. To cancel learning effects, we will balance the number of times 5) Assessing the perceived usability: We assess the per- the participants start with each of the languages and each of ceived usability through an SUS questionnaire which provides the problems. In other words, we will balance the participants a SUS score from 0 to 100, with an average value of 68 [7]. in groups A, B, C and D. Fig. 4. SEA_ML++ agent viewpoint possible solution. Fig. 5. DSML4MAS agent viewpoint possible solution TABLE I assessment of those solutions in a process which is somewhat E XPERIMENTAL DESIGN AND TASKS SEQUENCE similar to grading the result of a modelling exercise, in an academic context, following the criteria detailed in section Gr Ltr Dem Tut Cal Challenge 1 Challenge 2 SUS III-D1. We will then compute descriptive statistics for the A X X X X D1 / SEA_ML++ D2 / DSML4MAS X B X X X X D1 / DSML4MAS D2 / SEA_ML++ X collected metrics and test for significant differences between C X X X X D2 / SEA_ML++ D1 / DSML4MAS X the level of correctness achieved with each language. D X X X X D2 / DSML4MAS D1 / SEA_ML++ X 2) Speed: The data concerning speed will be collected during the visual inspection of the screencast of the sessions, by annotating the timestamps marking the begin and the end of F. Procedure each task. We will then compute descriptive statistics for the As depicted in Table I, before starting, each participant collected metrics and test for significant differences between will sign a letter of consent, adapted from [37] and fill in the duration of the tasks using each language. a demographic questionnaire, so that we record information 3) Rework: The data concerning rework will be collected about our participants, including country, age, genre, academic through visual inspection of the screencast. In particular, we level, previous experience with MAS and, in particular, with will collect and annotate with timestamps events of creation, each of the two analysed languages. This is followed by deletion, or update of model elements and associations among viewing a short tutorial on both languages. Then, the subject those elements. This will provide us with a timeline of the will perform an eye tracking device calibration, so that the eye model construction process for further analysis. Concerning tracking data of the session can be recorded with precision. To rework, we will analyse activities that undo previous work (e.g. maximise eye tracking recording precision, participants will a model element that was previously added to the model and be comfortably seated at a distance of about 60cm from a now is deleted). This will allow identifying when the partici- full HD 22 inch monitor and instructed not to move much pant is convinced he made a mistake and decides to backtrack. during the whole session. An EyeTribe eye tracker 2 will be Ultimately, we will explore whether the different languages placed below the monitor. The participant will also have a lead to different levels of rework, both in general, and with keyboard and a mouse, to be able to build a MAS model. particular sub-groups of participants, divided according to their After these preparatory tasks, the experiment itself can start. background (e.g. by level of expertise with MAS). During the whole session, a screencast of the contents of the 4) Visual effort: The eye tracking data is collected automat- screen will be recorded. Furthermore, eye tracking data will ically during the execution of the experiment. This produces also be collected, in sync with the screencast of the session. a time series of eye tracking events, namely fixations and The participant will have no time limit to finish his task, but saccades, with their duration, location, etc. The screen area our pilot sessions point to a duration of about 20 minutes to will be annotated with relevant areas of interest, so that we perform the given tasks. Finally, the subject answers a SUS can use the eye tracking data to monitor how each participant test [7], so that we may contrast his opinions on the usability navigated through those areas, during the process. We will use of SEA_ML++ and of DSML4MAS. custom-made tools from the NOVA LINCS team to support G. Analysis procedure this analysis. In the end, we expect to use heat maps to analyse where the most important focuses of visual attention were, and The data collected during the experiment sessions will be scanpath analysis to better understand the model navigation analysed using a combination of automated data collection for strategies of our participants. the questionnaires and eye tracking data, with manual data col- 5) Perceived usability: We will assess usability through lection, combining the visual inspection of the screencast with a SUS test. The SUS instrument is available in the testing the synchronised recorded audio of the think aloud protocol. environment as a web form. The collected data will be directly Concerning descriptive statistics, we will normally collect the fed into SPSS so that we may proceed with the comparative following ones, adjusting the actual set of descriptive statistics analysis of the distributions of the usability scores. to the scale type (nominal, ordinal, interval or ratio) of each variable: number of cases, mean, median, mode, standard IV. D ISCUSSION deviation, skewness, kurtosis, the p-value of the Shapiro-Wilk A. Expected results and implications normality test. We will then use appropriate statistics tests. We are interested in assessing how the usability is influenced For example, we plan to use the Welch t test, which is a more by the selection of one of these languages over the other. robust alternative to the t-test [38]) to compare the distributions Rather than using these results as a way of promoting the of correctness obtained with SEA_ML++ vs. DSML4MAS. usage of one of the languages, our goal is to identify language The statistics analysis will be run using SPSS 3 . improvement opportunities, on the one hand, and learning 1) Correctness: The data concerning correctness will be from the “competition”, on the other. This process is, in that collected through visual inspection of the solutions created sense, similar to the one the NOVA LINCS team has followed by the participants in our study. This implies a qualitative for supporting the Gyro language evolution [35] through a 2 http://www.theeyetribe.com/ series of developer experience evaluations. We have advocated 3 https://www.ibm.com/analytics/spss-statistics-software elsewhere [1] that software language development should be iterative and incremental, including (possibly lightweight) 3) External validity: Our participants will not have, in gen- evaluations after each iteration, so that improvement oppor- eral, much experience with MAS and with the two languages. tunities are identified as soon as possible, and, when feasible As such, our participants are better representatives of devel- and adequate, followed on in the next version of the language. opers who are learning these languages. Further research is Apart from the more “traditional” analysis of effectiveness, necessary to assess how these languages compare, when used here regarded from the perspective of correctness, and ef- by modellers who are experienced with the two languages. The ficiency, viewed considering the speed, we expect our ex- conclusions of this study will be applicable to these two MAS ploratory study on the process of building the models, with DSMLs. Replications with other languages, not necessarily an analysis of the time annotated sequences of insertions, for MAS, are required before we can generalise this study’s deletions and changes while constructing models to provide conclusions to other contexts. us insights on the main bottlenecks language users experience 4) Construct validity: After watching a short tutorial about during the model building process and, conversely, where they both languages, participants will solve a couple of challenges, seem to experience less difficulties. The eye tracking data is one with each language. This may cause an evaluation appre- expected to provide further context for better identifying lan- hension threat. We mitigate this by informing participants that guage improvement opportunities. In a longer run, the lessons the languages are being evaluated, not the participants. The learned in this and similar studies have the potential for help- experimental process is built so that we express no bias toward ing us designing more usable software modelling languages. any of the languages, to mitigate the risk of accidentally This will also help us better understanding how people from favouring SEA_ML++. Our goal is to identify opportunities different backgrounds interact with each modelling language, to improve SEA_ML++ rather than the comparison with building on earlier works that explored how different personal DSML4MAS itself. Our measures to mitigate this risk include characteristics (e.g. gender) impacted on the learning, problem choosing for the author of the recorded tutorials someone with solving and information processing style [39]. Finally, the SUS no vested interest in any of the languages and doing the same usability questionnaire will help us better understanding how for the researchers performing the data analysis. Further, in the the differences between both languages impact usability. interest of transparency and replicability, the data used in these evaluations and data analysis scripts for SPSS will be made B. Threats to validity publicly available. Last, but not the least, this paper discussing 1) Conclusion validity: Although we plan to have a reason- the experimental design to be used in this evaluation serves as able amount of participants (over 30), considering the nature of a manifest of interest in performing this particular experiment. this study, sample size is a likely threat, due to the difficulty in This creates an opportunity for a sanity check, where the initial recruiting participants. Our mitigation strategy is to have two goals of this study will be directly comparable with what is teams performing the study in two different countries. The actually tested in the experiment, and reported later, mitigating exercise of preparing the experimental replication package so the potential for selective publishing, where only favourable that it can be run both in Portugal and Turkey will help us results would be published. fine tune it making the package more reusable to third-party replications. This will directly mitigate the sample size risk, V. S UMMARY as we will have participants in both countries, and indirectly, We presented the experimental planning for the evaluation by facilitating potential third-party replications. of the case study of DSLs for Multi-agents Systems. Our goal 2) Internal validity: There is a potential learning effect is to go beyond the evaluation of the language’s notations from solving one challenge to the next. We mitigate this risk (concrete syntax) and evaluate the constructs composition at by having the crossover design so that half of the partici- the level of the instance sentence level (abstract syntax). pants start with a SEA_ML++ model while the other starts It is expected that the results of the evaluation planned with DSML4MAS. Another threat could be that a particular in this paper will help in identifying effective improvement problem would by accident favour one of the languages. To opportunities for the developer experience with SEA_ML++. mitigate it, both problems will be modelled in both languages, The work triggers future research in that it departs from the by different participants. We chose two languages for which more commonly explored part of visual modelling languages the tool support is at a similar level, and with a close look (their visual notation) to other relevant perspectives, namely and feel, so that tooling does not play a role in differentiating at the instance (sentence) level. among the two DSLs. We also made efforts so that all materials were easily readable in a 22 inch monitor and that ACKNOWLEDGMENT the models to be developed would fit nicely in a canvas on this kind of monitor, without requiring the user to scroll or The authors would like to thank the following: i) the zoom the image. Monitor size and the general layout for the Scientific and Technological Research Council of Turkey experiment, including the distance of the participant to the (TUBITAK) under grant 115E591, and ii) Portuguese grants monitor were constrained by the technical specifications of NOVA LINCS Research Laboratory (Grant: FCT/MCTES our eye tracking device. In spite of these constraints, the tasks PEst UID/ CEC/04516/2013) and DSML4MAS Project are already challenging to our participants. (Grant: FCT/MCTES TUBITAK/0008/2014). The authors would also like to thank the COST Action net- [21] A. El Kouhen, A. Gherbi, C. Dumoulin, and F. Khendek, “On the working mechanisms and support of IC1404 Multi-Paradigm semantic transparency of visual notations: Experiments with uml,” in Modeling for Cyber-Physical Systems (MPM4CPS). COST is International SDL Forum. Springer, 2015, pp. 122–137. supported by the EU Framework Programme Horizon 2020. [22] N. Genon, P. Heymans, and D. Amyot, “Analysing the cognitive effec- tiveness of the bpmn 2.0 visual notation,” in Proceedings of the Third R EFERENCES International Conference on Software Language Engineering, 2010, pp. 377–396. [1] A. Barisic, V. Amaral, and M. Goulão, “Usability driven DSL devel- [23] D. L. Moody, “Why a diagram is only sometimes worth opment with USE-ME,” Computer Languages, Systems & Structures, a thousand words: An analysis of the bpmn 2.0 visual vol. 51, pp. 118–157, 2018. notation,” Hämtat 2012-06-19 från http://www. business. uq. [2] M. Mernik, J. Heering, and A. M. Sloane, “When and how to develop edu. au/sites/default/files/event/supportingD ocs/Analysis% 20of% domain-specific languages,” ACM Comput. Surv., vol. 37, no. 4, pp. 20BPMN% 202.0% 20Visual% 20Syntax. pdf, Tech. Rep., 2011. 316–344, 2005. [3] D. Moody, “The “physics” of notations: toward a scientific basis for [24] R. Matulevičius and P. Heymans, “Visually effective goal models using constructing visual notations in software engineering,” IEEE T Software kaos,” in International Conference on Conceptual Modeling. Springer, Eng, vol. 35, no. 6, pp. 756–779, 2009. 2007, pp. 265–275. [4] M. Challenger, S. Demirkol, S. Getir, M. Mernik, G. Kardas, and [25] R. Matulevicius and P. Heymans, “Comparing goal modelling languages: T. Kosar, “On the use of a domain-specific modeling language in the An experiment,” in International Working Conference on Requirements development of multiagent systems,” Eng Appl Artif Intel, vol. 28, pp. Engineering: Foundation for Software Quality, 2007, pp. 18–32. 111–141, 2014. [26] M. Santos, C. Gralha, M. Goulão, and J. a. Araujo, “Increasing the [5] T. Miranda, M. Challenger, B. T. Tezel, O. F. Alaca, V. Amaral, semantic transparency of the kaos goal model concrete syntax,” in 37th M. Goulão, and G. Kardas, “Improving the usability of a mas dsml,” in International Conference on Conceptual Modeling (ER 2018). Xi’an, 6th International Workshop on Engineering Multi-Agent Systems (EMAS China: Springer, October, 22–25 2018. 2018). Stockholm, Sweden: Springer, July, 14 2018. [6] C. Hahn, “A domain specific modeling language for multiagent systems,” [27] P. Caire, N. Genon, P. Heymans, and D. L. Moody, “Visual notation in Proceedings of the 7th international joint conference on Autonomous design 2.0: Towards user comprehensible requirements engineering agents and multiagent systems-Volume 1, 2008, pp. 233–240. notations,” in RE’13. IEEE, 2013, pp. 115–124. [7] J. Brooke et al., “Sus-a quick and dirty usability scale,” Usability [28] N. Genon, P. Caire, H. Toussaint, P. Heymans, and D. Moody, “Towards evaluation in industry, vol. 189, no. 194, pp. 4–7, 1996. a more semantically transparent i* visual syntax,” in International Work- [8] G. Kardas and J. J. Gomez-Sanz, “Special issue on model-driven ing Conference on Requirements Engineering: Foundation for Software engineering of multi-agent systems in theory and practice,” Comput Lang Quality, 2012, pp. 140–146. Syst Str, vol. 50, pp. 140–141, 2017. [29] M. Santos, C. Gralha, M. Goulão, J. a. Araujo, and A. Moreira, “On the [9] G. Beydoun, G. Low, B. Henderson-Sellers, H. Mouratidis, J. J. Gomez- impact of semantic transparency on understanding and reviewing social Sanz, J. Pavon, and C. Gonzalez-Perez, “Faml: a generic metamodel for goal models,” in 26th IEEE International Conference on Requirements mas development,” IEEE T Software Eng, vol. 35, no. 6, pp. 841–863, Engineering (RE 2018). Banff, Canada: IEEE, August, 20–24 2018. 2009. [10] G. Ciobanu and C. Juravle, “Flexible software architecture and language [30] H. Henriques, H. Lourenço, V. Amaral, and M. Goulão, “Improving the for mobile agents,” Concurr Comp-Pract E, vol. 24, no. 6, pp. 559–571, developer experience with a low-code process modelling language,” in 2012. ACM/IEEE 21st International Conference on Model Driven Engineering [11] J. M. Gascueña, E. Navarro, and A. Fernández-Caballero, “Model-driven Languages and Systems (MODELS). Copenhagen, Denmark: ACM, engineering techniques for the development of multi-agent systems,” Eng October 2018. Appl Artif Intel, vol. 25, no. 1, pp. 159–173, 2012. [31] H. Störrle, “On the impact of layout quality to understanding uml dia- [12] E. J. T. Gonçalves, M. I. Cortés, G. A. L. Campos, Y. S. Lopes, E. S. grams,” in Visual Languages and Human-Centric Computing (VL/HCC), Freire, V. T. da Silva, K. S. F. de Oliveira, and M. A. de Oliveira, “Mas- 2011 IEEE Symposium on. IEEE, 2011, pp. 135–142. ml 2.0: Supporting the modelling of multi-agent systems with different [32] H. Storrle, “On the impact of layout quality to understanding uml dia- agent architectures,” J Syst Software, vol. 108, pp. 77–109, 2015. grams: Diagram type and expertise,” in Visual Languages and Human- [13] S. Demirkol, M. Challenger, S. Getir, T. Kosar, G. Kardas, and Centric Computing (VL/HCC), 2012 IEEE Symposium on. IEEE, 2012, M. Mernik, “A dsl for the development of software agents working pp. 49–56. within a semantic web environment,” Computer Science and Information Systems, vol. 10, no. 4, pp. 1525–1556, 2013. [33] H. Störrle, “On the impact of layout quality to understanding uml [14] G. Kardas, B. T. Tezel, and M. Challenger, “Domain-specific modelling diagrams: size matters,” in International Conference on Model Driven language for belief-desire-intention software agents,” IET Softw, vol. 12, Engineering Languages and Systems. Springer, 2014, pp. 518–534. no. 4, pp. 356–364, 2018. [34] M. Santos, C. Gralha, M. Goulão, J. Araújo, A. Moreira, and J. Cam- [15] J. Faccin and I. Nunes, “A tool-supported development method for beiro, “What is the impact of bad layout in the understandability of social improved bdi plan selection,” Engineering Applications of Artificial goal models?” in 24th IEEE International Requirements Engineering Intelligence, vol. 62, pp. 195–213, 2017. Conference (RE’16). Beijing, China: IEEE, September, 12–16 2016. [16] D. Moody and J. van Hillegersberg, “Evaluating the visual syntax of [35] A. Barišić, J. Cambeiro, V. Amaral, M. Goulão, and T. Mota, “Lever- uml: An analysis of the cognitive effectiveness of the uml family of dia- aging teenagers feedback in the development of a domain-specific grams,” in International Conference on Software Language Engineering. language: the case of programming low-cost robots,” in Proceedings Springer, 2008, pp. 16–34. of the 33rd Annual ACM Symposium on Applied Computing. ACM, [17] D. L. Moody, P. Heymans, and R. Matulevičius, “Visual syntax does 2018, pp. 1221–1229. matter: improving the cognitive effectiveness of the i* visual notation,” Requir Eng, vol. 15, no. 2, pp. 141–175, 2010. [36] V. Basili, G. Caldiera, and H. Rombach, “Goal Question Metric [18] M. Challenger, G. Kardas, and B. Tekinerdogan, “A systematic ap- Paradigm,” Encyclopedia of Software Eng., vol. 1, pp. 528–532, 2001. proach to evaluating domain-specific modeling language environments [37] P. Runeson, M. Host, A. Rainer, and B. Regnell, Case study research for multi-agent systems,” Software Qual J, vol. 24, no. 3, pp. 755–795, in software engineering: Guidelines and examples. Wiley, 2012. Sep. 2016. [38] B. L. Welch, “The generalization of ‘student’s’ problem when [19] G. Kardas, E. Bircan, and M. Challenger, “Supporting the platform several different population variances are involved,” Biometrika, extensibility for the model-driven development of agent systems by the vol. 34, no. 1-2, pp. 28–35, 1947. [Online]. Available: http: interoperability between domain-specific modeling languages of multi- //dx.doi.org/10.1093/biomet/34.1-2.28 agent systems,” Comput Sci Inf Syst, vol. 14, no. 3, pp. 875–912, 2017. [20] F. Bergenti, E. Iotti, S. Monica, and A. Poggi, “Agent-oriented model- [39] L. Beckwith and M. Burnett, “Gender: An important factor in end-user driven development for jade with the jadel programming language,” programming environments?” in Visual Languages and Human Centric Comput Lang Syst Str, vol. 50, pp. 142–158, 2017. Computing, 2004 IEEE Symposium on. IEEE, 2004, pp. 107–114.