=Paper=
{{Paper
|id=None
|storemode=property
|title=Does the Level of Detail of UML Models Affect the Maintainability of Source Code?
|pdfUrl=https://ceur-ws.org/Vol-785/paper3.pdf
|volume=Vol-785
|dblpUrl=https://dblp.org/rec/conf/models/ChaudronGAMP11
}}
==Does the Level of Detail of UML Models Affect the Maintainability of Source Code?==
<pdf width="1500px">https://ceur-ws.org/Vol-785/paper3.pdf</pdf>
<pre>
MODELS'11 Workshop - EESSMod 2011


                    Does the Level of Detail of UML Models Affect the
                             Maintainability of Source Code?

                     Ana M. Fernández-Sáez1, Marcela Genero2, and Michel R.V. Chaudron3
                1
                 Alarcos Quality Center, S.L., Department of Technologies and Information Systems,
                                          University of Castilla-La Mancha
                               Paseo de la Universidad 4, 13071, Ciudad Real, Spain
                                               +34 926295300 ext.6648
                             ana.fernandez@alarcosqualitycenter.com
                2
                    ALARCOS Research Group, Department of Technologies and Information Systems,
                                        University of Castilla-La Mancha
                              Paseo de la Universidad 4, 13071, Ciudad Real, Spain
                                           +34 926295300 Ext. 3740
                                         Marcela.Genero@uclm.es
                                             3
                                               LIACS - Leiden University
                                  Niels Bohrweg 1, 2333 CA Leiden, The Netherlands
                                             +31 715277065 (secr 7061)
                                               chaudron@liacs.nl


                     Abstract. This paper presents an experiment carried out as a pilot study to
                     obtain a first insight into the influence of the quality of UML models on the
                     maintenance of the corresponding source code. The quality of the UML models
                     is assessed by studying the amount of information they contain as measured
                     through a level of detail metric. The experiment was carried out with 11
                     Computer Science students from the University of Leiden. The results obtained
                     indicate a slight tendency towards obtaining better results when using low level
                     of detail UML models, which contradicts our expectations based on previous
                     research found in literature. Nevertheless, we are conscious that the results
                     should be considered as preliminary results given the low number of subjects
                     that participated in the experiment. Further replications of this experiment are
                     planned with students and professionals in order to obtain more conclusive
                     results.

                     Keywords: UML, maintenance, empirical studies, controlled experiment


           1        Introduction

           The current increasing complexity of software projects [1] has led to the emergence of
           UML [2] as a tool with which to increase the understanding between customer and
           developer and to improve communication among team members [3]. Despite this, not
           all UML diagrams have the same complexity, layout, level of abstraction, etc.
           Previous studies have shown that the style and rigor used in the diagrams may vary


                                                          -3-
MODELS'11 Workshop - EESSMod 2011


           2       Ana M. Fernández-Sáez1, Marcela Genero2, and Michel R.V. Chaudron3


           considerably throughout software projects [4], in addition to affecting the source code
           of the system in a different way.
              On the one hand, the different purposes for which a model may be intended (for
           example: architecting solutions, communicating design decisions, detailed
           specification for implementation, or automatically generating implementation code)
           signifies that the same system can be represented with different styles. On the other
           hand, the development diagrams are sometimes available for maintainers, but this is
           not always the case, and the diagrams must be generated with a reverse engineering
           process. The difference in the origin of the models and the different techniques that
           can be used to generate a reverse engineering model result in different styles of
           models. Some of the most notable differences between these models may be the level
           of detail shown. In this work we therefore analyze whether the different levels of
           detail (LoD) affect the work that must be carried out by a maintainer.
              This document is organized as follows. Section 2 presents the related work. Section
           3 presents the description of the experiment. The results obtained in the experiment
           are presented in Section 4, whilst the threats to validity are summarized in Section 5.
           Finally, Section 6 outlines our main conclusions and future work.


           2     Related work

           We performed an SLR [5] to discover all the empirical studies performed as regards
           the use of UML in maintenance, and found only the following two works related to
           the maintenance of source code:
               ─ In [6] an experiment was performed to investigate whether the use of UML
                 influences maintenance in comparison to the use of only source code. The
                 results of this work show a positive influence of the presence of UML for
                 maintainers.
               ─ In the work presented in [7], the experiment performed is focused on the
                 comprehension and the difficulties involved in maintaining object-oriented
                 systems. UML models were also presented to the subjects of the experiment, but
                 they were only focused on exploring the participant’s strategies and problems
                 while they were conducting maintenance tasks on an object-oriented
                 application.
              We therefore decided to perform an experiment related to the influence of different
           levels of detail on UML diagrams when assisting in maintenance tasks. We found a
           paper [8] focused on the understandability of models with different LoD in the
           development phase. The results show a better understanding of models when they
           have a high LoD. We would like to discover whether high LoD diagrams help
           workers to perform the changes that need to be made to the source code during the
           maintenance phase.


           3     Experiment description

             The experiment was carried out at the University of Leiden (The Netherlands) in
           March 2011. In order to run and report this experiment, we followed the


                                                    -4-
MODELS'11 Workshop - EESSMod 2011


            Does the Level of Detail of UML Models Affect the Maintainability of Source Code?     3


           recommendations provided in several works [9-11]. The experiment was presented by
           following the guidelines for reporting empirical research in software engineering [11]
           as closely as possible. The experimental material is available for downloading at:
           http://alarcos.esi.uclm.es/experimentUMLmaintenance/
              In the following subsections we shall describe the main characteristics of the
           experiment, including goal, context, variables, subjects, design, hypotheses, material,
           tasks, experiment procedure and analysis procedure.

           3.1     Goal
           The principal goal of this experiment was to investigate whether the LoD in UML
           models influences the maintenance of source code. The GQM template for goal
           definition [12, 13] was used to define the goal of our experiment as follows: “Analyze
           the level of detail in UML models with the purpose of evaluating it with respect to the
           maintainability of source code from the point of view of researchers, in the context of
           Computer Science students at the University of Leiden.
              As in [3], we considered that the LoD in UML models should be defined as the
           amount of information that is used to represent a modeling element. LoD is a
           'continuous' metric, but for the experiment we have taken two “extremes” - high and
           low LoD.
              We decided to use 3 different types of diagrams (use case, sequence and class
           diagrams) since they are those most frequently used. When the LoD used in a UML
           model is low, it typically employs only a few syntactical features, such as class-name
           and associations, without specifying any further facts about the class. When it is high,
           the model also includes class attributes and operations, association names, association
           directionality, and multiplicity. In sequence diagrams, in which there is a low LoD,
           the messages among objects have an informal label, and when the LoD is high the
           label is a method name plus the parameter list. We consider that it is not possible to
           distinguish between low and high LoD in use case diagrams because they are very
           simple diagrams. The elements that fit each level of detail are shown in Table 1.

                                    Table 1. Levels of detail in UML models

              Diagram                      Element                       Low LoD       High LoD
                           Classes (box and name)                                        
                           Attributes                                                     
                           Types in attributes                                            
                           Operations                                                     
                  Class    Parameters in operations                                       
                 diagram   Associations                                                  
                           Association directionalities                                   
                           Association multiplicities                                     
                           Aggregations                                                  
                           Compositions                                                  
              Sequence     Actors                                                        
              diagram      Objects                                                       


                                                      -5-
MODELS'11 Workshop - EESSMod 2011


           4       Ana M. Fernández-Sáez1, Marcela Genero2, and Michel R.V. Chaudron3


                           Messages in informal language                   
                           Messages with formal language (name                           
                           of a method)
                           Parameters in messages                                        
                           Labels in return messages                                     


           3.2    Context Selection
           The experimental objects consisted of use case, class and sequence diagrams and the
           JAVA code of two software systems, which are summarized below:
               ─ A-H: high LoD diagrams and JAVA code of system A.
               ─ A-L: low LoD diagrams and JAVA code of system A.
               ─ B-H: high LoD diagrams and JAVA code of system B.
               ─ B-L: low LoD diagrams and JAVA code of system B.
           Diagrams A-x described a library domain from which a user can borrow books.
           Diagrams B-x described a sport centre domain from which users can rent services
           (tennis courts, etc.). System A is a Library extracted from [14]. We decided to use it
           because it was a representative system, it was complete (source code and models were
           available) and it gave us a starting point from which to compare our results (it was
           only possible to compare the results obtained from the subjects who received System
           A with high LoD with [7]). System B is a Sport centre application created as part of
           the Master’s degree Thesis of a student from the University of Castilla-La Mancha,
           and we therefore consider it to be a real system. Both systems are desktop
           applications and have more or less the same complexity. These experimental objects
           were presented in English.
              The subjects students on a Software Engineering course from which they had
           acquired training in UML diagrams. Their knowledge was sufficient for them to
           understand the given systems, and they had roughly the same background. They had
           knowledge about the use of UML diagrams in general, but they were taught about
           UML diagrams and JAVA in a training session organized to take place the day before
           the experiment was carried out.
              The experiment was carried out by 11 Computer Science students from the
           University of Leiden (The Netherlands) who were taking the Software Engineering
           course in the second-year of their B.Sc.
              Working with students also implies various advantages, such as the fact that their
           prior knowledge is fairly homogeneous, there is the possible availability of a large
           number of subjects [15], and there is the chance to test experimental design and initial
           hypotheses [16]. An additional advantage of using novices as subjects in experiments
           on comprehensibility and modifiability is that the cognitive complexity of the objects
           under study is not hidden by the subjects’ experience. Nonetheless, we also wish to
           test the findings with practitioners in order to strengthen the external validity of the
           results obtained.
              The students who participated in the experiment were volunteers selected for
           convenience (the students available in the corresponding course). Social threats


                                                     -6-
MODELS'11 Workshop - EESSMod 2011


            Does the Level of Detail of UML Models Affect the Maintainability of Source Code?    5


           caused by evaluation apprehension were avoided by not grading the students on their
           performance.

           3.3    Variables selection
           The independent variable (also called “main factor”) is the LoD, which is a nominal
           variable with two values (low LoD and high LoD). We combined each level of the
           independent variable with the two different systems used to obtain four treatments
           (see Table 2).
              The dependent variables are modifiability and understandability. These two
           variables were considered because understandability and modifiability directly
           influence maintainability [17]. In order to measure these dependant variables, we
           defined the following measures:
              ─ Understandability Effectiveness (UEffec): This measure reflects the ability to
                correctly understand the system presented. It is calculated with the following
                formula: number of correct answers / number of questions. A higher value of
                this measure reflects a better understandability.
              ─ Modifiability Effectiveness (UEffic): This measure reflects the ability to
                correctly modify the system presented. It is calculated with the following
                formula: number of correctly performed modification tasks / number of
                modification tasks. A higher value of this measure reflects a better modifiability.
              ─ Understandability Efficiency (MEffec): This measure also reflects the ability to
                correctly understand the system presented. It is calculated with the following
                formula: time spent / number of correctly answered questions. A lower value of
                this measure reflects a better understandability.
              ─ Modifiability Efficiency (MEffic): This measure also reflects the ability to
                correctly modify the system presented. It is calculated with the following
                formula: time spent / number of correctly performed tasks. A lower value of this
                measure reflects a better modifiability.
              Additional independent variables (called “co-factors”) were considered according
           to the experimental design of the replication, and their effect has been controlled and
           analyzed:
              ─ Order. The selected design (see Table 2), i.e., the variation in the order of
                application of each method (low LoD, high LoD), was intended to alleviate
                learning effects. Nonetheless, we analyzed whether the order in which the LoD
                were used by the subjects biased the results.
              ─ System. This factor indicates the systems (i.e., A and B) used as experimental
                objects. The design selected for the experiment (see Table 2) forced us to
                choose two application domains in order to avoid learning effects. Our intention
                was that the system factor would not be a confounding factor that might also
                influence the subjects’ performances. We therefore selected well-known
                domains and experimental objects of a similar complexity.


                                                      -7-
MODELS'11 Workshop - EESSMod 2011


           6      Ana M. Fernández-Sáez1, Marcela Genero2, and Michel R.V. Chaudron3


           3.4    Hypotheses formulation
           Based on the assumption that the more information a model contains, the more is
           known about the concepts/knowledge described in the model, the hypothesis are:
           1. H1,0: There is no significant difference in the subjects’ understandability
              effectiveness when working with UML diagrams modeled using high or low levels
              of detail.
              H1,1:H1,0
           2. H2,0: There is no significant difference in the subjects’ understandability efficiency
              when working with UML diagrams modeled using high or low levels of detail.
              H2,1:H2,0
           3. H3,0: There is no significant difference in the subjects’ modifiability effectiveness
              when working with UML diagrams modeled using high or low levels of detail.
              H3,1:H3,0
           4. H4,0: There is no significant difference in the subjects’ modifiability efficiency
              when working with UML diagrams modeled using high or low levels of detail.
              H4,1: H4,0
             The goal of the statistical analysis will be to reject these null hypotheses and
           possibly to accept the alternative ones (e.g., Hn1=¬ Hn0).

           3.5    Experimental design
           We selected a balanced factorial design in which the group-interaction acted as a
           confounding factor [18] which permits the lessening of the effects of learning and
           fatigue. The experiment’s execution consisted of two runs. In each round, each of the
           groups was given a different treatment. The corresponding system (source code +
           UML models) was assigned to each group at random, but was given out in a different
           order in each case. Table 2 presents the outline of the experimental design.

                                        Table 2. Experimental design
                     RUN 1              LoD                RUN 2            LoD
                                   Low      High                        Low     High
                             A    Group 1 Group 2                  A   Group 3 Group 4
                   System                               System
                             B    Group 3 Group 4                  B   Group 2 Group 1

              Before carrying out the experiment, we provided the subjects with a background
           questionnaire and assigned them to the 4 groups randomly, based on the marks
           obtained in the aforementioned questionnaire (blocked design by experience) in an
           attempt to alleviate experience effects. To avoid a possible learning effect, the
           diagrams came from different application domains (A-a Library and B-a Sport
           centre).
              When designing the experiment we attempted to alleviate several issues that might
           threaten the validity of the research done by considering the suggestions provided in
           [19].


                                                     -8-
MODELS'11 Workshop - EESSMod 2011


            Does the Level of Detail of UML Models Affect the Maintainability of Source Code?   7


           3.6    Experimental tasks
           The tasks to be performed did not require high levels of industrial experience, so we
           believed that the use of students could be considered appropriate, as suggested in
           literature [20, 21]. The material used was written in English.
               There were three kinds of tasks:
              ─ Understandability task: This contained 3 questions concerning the semantics
                of the system, i.e. the semantics of diagrams and the semantics of code. These
                questions were multiple choice questions and were used to obtain UEffec and
                UEffic.
              ─ Modifiability task: The subjects received a list of requirements in order to
                modify the code of the system in order to add/change certain functionalities.
                This part of the experiment contained 3 modifiability tasks and allowed us to
                calculate MEffec and MEffic. The subjects were provided with answer sheets to
                allow them to structure their responses related to maintenance tasks. They had to
                fill in a different form depending on the element that they wished to maintain.
                The answer sheets can be found at:
                 http://alarcos.esi.uclm.es/experimentUMLmaintenance/
              ─ Post-questionnaire task: At the end of the execution of each run, the subjects
                were asked to fill in a post-experiment questionnaire, whose goal was to obtain
                feedback about the subjects’ perception of the experiment execution, which
                could be used to explain the results obtained. The answers to the questions were
                based on a five-point Likert scale [22].

           3.7    Experimental procedure
           The experiment took place in two sessions of two hours each. The subjects first
           attended a training session in which detailed instructions on the experiment were
           presented and the main concepts of UML and JAVA were revised. In this session, the
           subjects carried out an exercise similar to those in the experimental tasks in
           collaboration with the instructor. During the training session, the subjects were
           required to fill in a background questionnaire. Based on the marks obtained in this
           questionnaire, the subjects were randomly assigned to the 4 groups shown in Table 2,
           thus obtaining balanced groups in accordance with the marks obtained in the
           background questionnaire.
              The experiment then took place in a second session, consisting of two runs. In each
           run, each of the groups was given a different treatment, as is shown in Table 2.
              The experiment was conducted in a classroom, where the students were supervised
           by the instructor and no communication among them was allowed.
              After the experiment execution, the data collected from the experiment were placed
           on an excel sheet.

           3.8    Analysis procedure
           The data analysis was carried out by considering the following steps:


                                                      -9-
MODELS'11 Workshop - EESSMod 2011


           8          Ana M. Fernández-Sáez1, Marcela Genero2, and Michel R.V. Chaudron3


           1. We first carried out a descriptive study of the measures of the dependent variables,
              i.e., understandability and modifiability.
           2. We then tested the formulated hypotheses using the non-parametric Kruskal-Wallis
              test [23] for the data collected in the experiment. The use of this test was possible
              because, according the design of the controlled experiment, we obtained paired
              samples. In addition, Kruskal-Wallis is the most appropriate test with which to
              explore the results of a factorial design with confounded interaction [18, 24], i.e.,
              the design used in our experiment, when there is non-normal distribution of the
              data.
           3. We next used the Kruskal-Wallis test to analyze the influence of the co-factors
              (i.e., System and Order).
           4. The data collected from the post-experiment questionnaire was finally analyzed
              using bar graphs.


           4     Results
           The following subsections show the results of the data analysis of the experiment
           performed using SPSS [25].

           4.1    Descriptive statistics and exploratory analysis
           Table 3 and Table 4 show the descriptive statistics of the Understandability and
           Modifiability measures, respectively (i.e., mean ( ), standard error (SE), and
           standard deviation (SD)), grouped by LoD.

                                  Table 3. Descriptive statistics for UEffec and UEffic.
                                                        Ueffec                             UEffic
               LoD          Subjects
                                                        SE           SD                      SE       SD
               Low      N = 10 (1 outlier)   0.767     0.051        0.161   334.500        36.308   114.816
               High          N = 11          0.758     0.650        0.215   363.924        82.602   273.960

                                  Table 4. Descriptive statistics for Meffec and MEffic.
                                                  Meffec                            MEffic
                     LoD     Subjects
                                                   SE         SD                     SE          SD
                  Low        N = 11     0.437     0.066      0.221     240.121     41.008      136.007
                  High       N = 11     0.402     0.050      0.169     294.637     47.198      156.539

             At a glance, we can observe that when the subjects used low LoD diagrams they
           obtained better values in all variables. This indicates that low LoD diagrams may, to
           some extent, improve the comprehension and modification of the source code.


                                                           - 10 -
MODELS'11 Workshop - EESSMod 2011


            Does the Level of Detail of UML Models Affect the Maintainability of Source Code?            9


           4.2     Influence of LoD
           In order to test the formulated hypotheses we analyzed the effect of the main factor
           (i.e. LoD) on the dependent variables considered (i.e., UEffec, UEffic, MEffec and
           MEffic) using the Kruskal-Wallis test (see Table 5).

                          Table 5. Kruskal-Wallis test results for Ueffec, Ueffic, Meffec and Meffic
                                                 Ueffec     UEffic    Meffec   MEffic
                                        LoD       1         0.439     0.792    0.491

           Testing H1,0 (UEffec)
           The results in Table 5 suggest that the null hypothesis cannot be rejected since the p-
           value is greater than 0.05. This means that there is no significant difference in UEffec in
           either group.
               We decided to investigate this result in greater depth by calculating the number of
           subjects who achieved better values when using the low LoD models (i.e. a low LoD
           value is higher than a high LoD value):

                               Table 6. Comparison of subjects’ results for each measure
                          low LoD = high LoD              low LoD < high LoD        low LoD > high LoD
                 UEffec           6                               3                         2
                 UEffic           0                               7                         4
                 MEffec           0                               7                         4
                 MEffic           0                               5                         6

              As Table 6 shows, the number of subjects who obtained the same results for both
           treatments (high and low LoD) is relatively high. There were more subjects who
           performed better with a high LoD than with a low LoD, but the differences in
           comparison to the opposite group is very small (only one subject).

           Testing H2,0 (UEffic)
           The results in Table 5 suggest that the null hypothesis cannot be rejected since the p-
           value is greater than 0.05. This means that there is no significant difference in UEffic in
           either group.
               We decided to investigate this result in greater depth by calculating the number of
           subjects who achieved better values when using the low LoD models (i.e. a low LoD
           value is smaller than a high LoD value):
              As Table 6 shows, no subjects obtained the same UEffic for both treatments (high
           and low LoD). More subjects performed better with a low LoD than with a high LoD.

           Testing H3,0 (MEffec)
           The results in Table 5 suggest that the null hypothesis cannot be rejected since the p-
           value is greater than 0.05. This means that there is no significant difference in MEffec
           in either group.


                                                             - 11 -
MODELS'11 Workshop - EESSMod 2011


           10       Ana M. Fernández-Sáez1, Marcela Genero2, and Michel R.V. Chaudron3


              We decided to investigate this result in greater depth by calculating the number of
           subjects who achieved better values when using the low LoD models (i.e. a low LoD
           value is higher than a high LoD value):
              As Table 6 shows, no subjects obtained the same MEffec for both treatments (high
           and low LoD). More subjects performed better with a high LoD than with a low LoD.

           Testing H4,0 (MEffic)
           The results in Table 5 suggest that the null hypothesis cannot be rejected since the p-
           value is greater than 0.05. This means that there is no significant difference in MEffic
           in either group.
               We decided to investigate this result in greater depth by calculating the number of
           subjects who achieved better values when using the low LoD models (i.e. a low LoD
           value is smaller than a high LoD value):
              As Table 6 shows, no subjects obtained the same MEffic for both treatments (high
           and low LoD). More subjects performed better with a high LoD than with a low LoD,
           but the differences in comparison to the opposite group are also small.

           4.3    Influence of system
           In order to test the effect of the co-factor System, we performed a Kruskal-Wallis test
           whose results are shown in Table 7. As all the p-values were higher than 0.05, except
           in one case (UEffic), we did not have sufficient evidence to reject the hypothesis, i.e. it
           seems that the system did not influence the subjects’ performance (and this was
           therefore a controlled co-factor).

                        Table 7. Kruskal-Wallis test results for the influence of the System.
                                              Ueffec    UEffic    Meffec    MEffic
                                   System     0.804     0.035     0.575     0.061


           4.4    Influence of order
           In order to test the effect of Order, we performed a Kruskal-Wallis test (see Table 8).
           As all p-values were higher than 0.05, we did not have sufficient evidence to reject
           the hypothesis, i.e. the order did not influence the subjects’ performance (and this was
           therefore a controlled co-factor).

                                       Table 8. Kruskal-Wallis tests results.
                                              Ueffec   UEffic    Meffec    MEffic
                                    Order      1       0.105     0.223     0.341


           4.5    Post- experiment survey questionnaire results
           The analysis of the answers to the post-experiment survey questionnaire revealed that
           the time needed to carry out the comprehension and modification tasks was


                                                       - 12 -
MODELS'11 Workshop - EESSMod 2011


           Does the Level of Detail of UML Models Affect the Maintainability of Source Code?        11


           considered to be inappropriate (more time was needed), and that the subjects
           considered the tasks to be quite difficult (Fig. 1).


                               Fig. 1. Subjects' perception of the experiment.

               We also asked about the subjects’ perception of some of the items that appeared in
           the high LoD diagrams but did not appear in the low LoD diagrams. Fig. 2 shows that
           high LoD elements seem to be appreciated by the subjects. With regard to the
           histograms in Fig. 2, if a subject responds 1 or 2, this indicates that s/he thinks that the
           element in the question was helpful, while a response of 4 or 5 indicates that the
           elements in the question are not helpful (3 is a neutral response). If we focus on the
           elements related to class diagrams (upper histograms) we can see that attributes are
           helpful for 9 subjects (versus 1 subject who does not believe them to be helpful). The
           same is true of operations (10 subjects vs. 1 subject). If we focus on the elements
           related to sequence diagrams (lower histograms) we can see that formal messages are
           more helpful (16 subjects) than natural language messages (0 subjects), and the same
           can also be said of the appearance of parameters in messages (13 subjects vs. 2
           subjects).

           4.6    Summary and discussion of the data analysis
           The null hypothesis cannot be rejected for any of the dependent variables. Although
           we cannot draw conclusive results on the main factor (LoD), we have found that co-
           factors (system, order) have not influenced the results.
              Nevertheless, the descriptive statistics in general showed a slight tendency in favor
           of using low LoD diagrams in contrary to what we believed, as the diagrams with a
           high LoD helped developers in the software development stage [8]. This may result
           from the fact that the subjects did not have the expected amount of knowledge about
           UML (a mean of 8.8 correct answers out of 16 questions) and JAVA (a mean of 4.9
           correct answers out of 9 questions) tested in the background questionnaire. The results
           of the experiment must be considered as preliminary results owing to the small size of
           the group of subjects who participated in the experiment.


                                                      - 13 -
MODELS'11 Workshop - EESSMod 2011


           12         Ana M. Fernández-Sáez1, Marcela Genero2, and Michel R.V. Chaudron3


                    Fig. 2. Subjects’ opinion of LoD (1=Complete Agreement 2=Partial Agreement
                     3=Neither agree/ nor disagree 4=Partial Disagreement 5=Total disagreement)
           5      Threats to Validity
           We must consider certain issues which may have threatened the validity of the
           experiment:
                ─ External validity: External validity may be threatened when experiments are
                  performed with students, and the representativeness of the subjects in
                  comparison to software professionals may be doubtful. In spite of this, the tasks
                  to be performed did not require high levels of industrial experience, so we
                  believed that this experiment could be considered appropriate, as suggested in
                  literature [13]. There are no threats related to the material used since the systems
                  used were real ones.
                ─ Internal validity: Internal validity threats are mitigated by the design of the
                  experiment. Each group of subjects worked on the same system in different
                  orders. Nevertheless, there is still the risk that the subjects might have learned
                  how to improve their performances from one performance to the other.
                  Moreover, the instrumentation was tested in a pilot study in order to check its
                  validity. In addition, mortality threats were mitigated by offering the subjects
                  extra points in their final marks.
                ─ Conclusion validity: Conclusion validity concerns the data collection, the
                  reliability of the measurement, and the validity of the statistical tests. Statistical
                  tests were used to reject the null hypotheses. We have explicitly mentioned and
                  discussed when non-significant differences were present. What is more,


                                                        - 14 -
MODELS'11 Workshop - EESSMod 2011


           Does the Level of Detail of UML Models Affect the Maintainability of Source Code?    13


                 conclusion validity might also be affected by the number of observations.
                 Further replications on larger datasets are thus required to confirm or contradict
                 the results.
               ─ Construct validity: This may be influenced by the measures used to obtain a
                 quantitative evaluation of the subjects’ performance, the comprehension
                 questionnaires, the maintenance tasks, and the post-experiment questionnaire.
                 The metrics used were selected to achieve a balance between the correctness
                 and completeness of the answers. The questionnaires were defined to obtain
                 sufficiently complex questions without them being too obvious. The post-
                 experiment questionnaire was designed using standard forms and scales. Social
                 threats (e.g., evaluation apprehension) have been avoided, since the students
                 were not graded on the results obtained.


           6     Conclusions and future work

           The main concern of the research presented in this paper is the use of a controlled
           experiment to investigate whether the use of low or high level of detail in UML
           diagrams influences the maintainer’s performance when understanding and modifying
           source code. The experiment was carried out by 11 academic students from the
           University of Leiden in the Netherlands.
              The results obtained are not significant owing to various factors such as the fact
           that the subjects selected had a low level of experience in using UML and JAVA
           code, and the small size of the group of subjects who participated in the experiment. It
           is only possible to observe a slight tendency towards obtaining better results with low
           LoD diagrams, contrary to the results obtained in [8].
              Despite these drawbacks, we have ensured that the experimental results were not
           influenced by other co-factors such as the system used or the order in which the
           subjects received the experimental material.
              We are planning to perform two replications with students from the University of
           Castilla-La Mancha (Spain) and students from the University of Bari (Italy). A third
           possible replication with professionals is also being planned. All the drawbacks found
           in the execution of this experiment will be taken into account in the replications.

           Acknowledges. This research has been funded by the following projects: MEDUSAS
           (CDTI-MICINN and FEDER IDI- 20090557), ORIGIN (CDTI-MICINN and FEDER
           IDI-2010043(1-5), PEGASO/MAGO (MICINN and FEDER, TIN2009-13718-C02-
           01), EECCOO (MICINN TRA2009-0074), MECCA (JCMM PII2I09-0075-8394) and
           IMPACTUM (PEII 11-0330-4414).

           References
           1. Van Vliet, H., Software Engineering: Principles and Practices 3rd ed. 2008:
              Wiley.
           2. OMG. The Unified Modeling Language. Documents associated with UML Version
              2.3 2010; Available from: http://www.omg.org/spec/UML/2.3.
           3. Nugroho, A. and M.R.V. Chaudron. Evaluating the impact of UML modeling on
              software quality: An industrial case study. in Proceeding of 12th International


                                                     - 15 -
MODELS'11 Workshop - EESSMod 2011


           14      Ana M. Fernández-Sáez1, Marcela Genero2, and Michel R.V. Chaudron3


               Conference on Model Driven Engineering Languages and Systems (MODELS’09).
               2009.
           4. Lange, C.F.J. and M.R.V. Chaudron, In practice: UML software architecture and
               design description. IEEE Software, 2006. 23(2): p. 40-46.
           5. Fernández-Sáez, A.M., M. Genero, and M.R.V. Chaudron, Empirical studies on
               the influence of UML in software maintenance tasks: A systematic literature
               review. Submitted to Science of Computer Programming - Special issue on
               Software Evolution, Adaptability and Maintenance, Elsevier.
           6. Dzidek, W.J., E. Arisholm, and L.C. Briand, A realistic empirical evaluation of
               the costs and benefits of UML in software maintenance. IEEE Transactions on
               Software Engineering, 2008. 34(3): p. 407-432.
           7. Karahasanovic, A. and R. Thomas. Difficulties Experienced by Students in
               Maintaining Object-Oriented Systems: an Empirical Study. in Proceedings of the
               Australasian Computing Education Conference (ACE'2007) 2007.
           8. Nugroho, A., Level of detail in UML models and its impact on model
               comprehension: A controlled experiment. Information and Software Technology,
               2009. 51(12): p. 1670-1685.
           9. Juristo, N. and A. Moreno, Basics of Software Engineering Experimentation.
               2001: Kluwer Academic Publishers.
           10. Wohlin, C., et al., Experimentation in Software Engineering: an Introduction.
               2000: Kluwer Academic Publisher.
           11. Jedlitschka, A., M. Ciolkowoski, and D. Pfahl, Reporting Experiments in Software
               Engineering, in Guide to Advanced Empirical Software Engineering F. Shull, J.
               Singer, and D.I.K. Sjøberg, Editors. 2008, Springer Verlag.
           12. Basili, V. and D. Weiss, A Methodology for Collecting Valid Software
               Engineering Data. IEEE Transactions on Software Engineering, 1984. 10(6): p.
               728-738.
           13. Basili, V., F. Shull, and F. Lanubile, Building Knowledge through Families of
               Experiments. IEEE Transactions on Software Engineering, 1999. 25: p. 456-473.
           14. Eriksson, H.E., et al., UML 2 Toolkit. 2004: Wiley.
           15. Verelst, J. The Influence of Abstraction on the Evolvability of Conceptual Models
               of Information Systems. in International Symposium on Empirical Software
               Engineering (ISESE'04). 2004.
           16. Sjøberg, D.I.K., et al., A Survey of Controlled Experiments in Software
               Engineering. IEEE Transaction on Software Engineering, 2005. 31(9): p. 733-753.
           17. ISO/IEC, ISO/IEC 25000: Software Engineering, in Software product quality
               requirements and evaluation (SQuaRe). 2008, International Organization for
               Standarization.
           18. Kirk, R.E., Experimental Design. Procedures for the Behavioural Sciences. 1995:
               Brooks/Cole Publishing Company.
           19. Wohlin, C., et al., Experimentation in Software Engineering: An Introduction.
               2000, Norwell, MA, USA: Kluwer Academic Publishers.
           20. Basili, V., F. Shull, and F. Lanubile, Building Knowledge through Families of
               Experiments. IEEE Transactions on Software Engineering, 1999. 25(4): p. 456-
               473.
           21. Höst, M., B. Regnell, and C. Wholin. Using students as subjects - a comparative
               study of students and professionals in lead-time impact assessment. in 4th


                                                   - 16 -
MODELS'11 Workshop - EESSMod 2011


           Does the Level of Detail of UML Models Affect the Maintainability of Source Code?   15


               Conference on Empirical Assessment and Evaluation in Software Engineering.
               2000.
           22. Oppenheim, A.N., Questionnaire Design, Interviewing and Attitude Measurement.
               1992: Pinter Publishers.
           23. Conover, W.J., Practical Nonparametric Statistics. 3rd ed. 1998: Wiley.
           24. Winer, B.J., D.R. Brown, and K.M. Michels, Statistical Principles in
               Experimental Design. 3rd ed. 1991: Mc Graw Hill Series in Psychology.
           25. SPSS, SPSS 12.0, Syntax Reference Guide. 2003, Chicago, USA: SPSS Inc.


                                                     - 17 -

</pre>