=Paper=
{{Paper
|id=Vol-2308/isee2019paper05
|storemode=property
|title=Code Process Metrics in University Programming Education
|pdfUrl=https://ceur-ws.org/Vol-2308/isee2019paper05.pdf
|volume=Vol-2308
|authors=Linus W. Dietz,Robin Lichtenthäler,Adam Tornhill,Simon Harrer
|dblpUrl=https://dblp.org/rec/conf/se/DietzLTH19
}}
==Code Process Metrics in University Programming Education==
<pdf width="1500px">https://ceur-ws.org/Vol-2308/isee2019paper05.pdf</pdf>
<pre>
   Code Process Metrics in University Programming
                     Education
                        Linus W. Dietz∗ , Robin Lichtenthäler† , Adam Tornhill‡ , and Simon Harrer§
                    ∗ Department of Informatics, Technical University of Munich, Germany, linus.dietz@tum.de
              † Distributed Systems Group, University of Bamberg, Germany, robin.lichtenthaeler@uni-bamberg.de
                                              ‡ Empear, Sweden, adam.tornhill@empear.com
                                    § innoQ Deutschland GmbH, Germany, simon.harrer@innoq.com


   Abstract—Code process metrics have been widely analyzed                 covering XML serialization, testing and GUIs, and ‘Introduc-
within large scale projects in the software industry. Since they           tion to Parallel and Distributed Programming’ (PKS) covering
reveal much about how programmers collaborate on tasks, they               systems communicating through shared memory and message
could also provide insights in the programming and software
engineering education at universities. Thus, we investigate two            passing on the Java Virtual Machine. Students typically take
courses taught at the University of Bamberg, Germany to gain               AJP in their third semester and PKS in their fifth.
insights into the success factors of student groups. However, a
correlation analysis of eight metrics with the students’ scores                                          TABLE I
revealed only weak correlations. In a detailed analysis, we                                    OVERVIEW OF THE A SSIGNMENTS
examine the trends in the data per assignment and interpret this
using our knowledge of code process metrics and the courses.                     Course    #      Technologies
We conclude that the analyzed programming projects were not
suitable for code process metrics to manifest themselves because                   AJP     1      IO and Exceptions
                                                                                           2      XML mapping with JAXB and a CLI-based UI
of their scope and students’ focus on the implementation of
                                                                                           3      JUnit Tests and JavaDoc documentation
functionality rather than following good software engineering                              4      JavaFX GUI with MVC
practices. Nevertheless, we can give practical advice on the
interpretation of code process metrics of student projects and                    PKS      1      Mutexes, Semaphores, BlockingQueue
suggest analyzing projects of larger scope.                                                2      Executor, ForkJoin, and Java Streams
                                                                                           3      Client/server with TCP
                                                                                           4      Actor model with Akka
                        I. I NTRODUCTION
   When teaching programming or practical software engineer-
ing courses, lecturers often give students advice on how to                   The courses follow a similar didactic concept that has
manage their group work to be successful. Such advice could                constantly been evolved since 2011 [2]. During the semester,
be to start early so students don’t miss the deadline, or to               the students submit four two-week assignments (see Table I)
split up the tasks so everybody learns something. Intuitively,             solved by groups of three. These assignments require the
such practices deem appropriate, but do they actually lead                 application of the previously introduced programming concepts
to more successful group work? To answer this, objective                   and technologies from the lectures to solve realistic problems,
metrics are needed as evidence. Code process metrics capture               such as implementing a reference manager or an issue tracker.
the development progress [1], as opposed to looking at the                 For each assignment, the groups get a project template with
outcome of using static code analysis [2]. Since they have                 a few predefined interfaces. We provide a Git repository for
been successfully used in the software industry [3], they                  each group to work with and to submit their solutions. Since
might be useful in programming education to give advice on                 undergraduates in their third term are usually not proficient
how to organize the development process of student projects.               with version control systems, we also hold a Git tutorial at
As a first step towards applying code process metrics in                   the beginning of the course, covering how to commit, push,
programming education, we want to assess their explanatory                 merge, resolve conflicts, and come up with good commit
power considering students’ success. We mine and analyze                   messages. More advanced topics like working with feature
Git repositories of two programming courses to answer our                  branches or structured commit messages are not in scope of
research question: “How meaningful are code process metrics                this introduction.
for assessing the quality of student programming assignments?”                We grade each assignment in form of a detailed textual code
By this, we hope to gain insights and provide recommendations              review and a score between 0 and 20 points. The main part
to lecturers teaching such courses.                                        of that score accounts for functional correctness, which we
                                                                           check with the help of unit tests. However, we also evaluate
                            II. M ETHOD                                    the code quality, determined by a thorough code review. To
  The subject of analysis are two practical programming                    avoid bias from one lecturer, we established a peer-review by
courses for undergraduate computer science students at the                 the other lecturer. By this, the score should be an objective
University of Bamberg: ‘Advanced Java Programming’ (AJP)                   indicator for the quality of the solution. Over the years, we


ISEE 2019: 2nd Workshop on Innovative Software Engineering Education @ SE19, Stuttgart, Germany                                              23
have built a knowledge base of typical code quality issues,
recently culminating into the book Java by Comparison [4],
which we use to refer to issues in the textual code review.

A. Data Set
   The data analyzed in this paper are the Git repositories
of one iteration of AJP (24 groups) and PKS (14 groups)
in the academic year of 2016. This results in a total of
152 submissions. All groups submitted four assignments and
no group scored less than 10 points in any assignment. An
assignment solution consists of all the commits related to the
assignment. Each commit includes its message, the changes
made, a time stamp, and the author. Each submission had
at most three authors, however, this number was sometimes                          Fig. 1. Distribution of points per courses and assignments
reduced to two, in case a student dropped out of the course.
Because of their limited experience with Git, the students                 imported project skeletons for each assignment and there were
worked solely on the master branch. Furthermore, since                     dependencies between the assignments. For example, in PKS
the focus of the courses was not on software engineering                   the very same task had to be solved using different technologies
skills, the students could freely choose how to collaborate                and the students were encouraged to copy their old solution to
on the assignments and we enforced no policy regarding to                  the current assignment for comparing the performances.
collaboration or the commit messages.                                                                  III. R ESULTS
B. Processing and Metrics                                                     Before investigating the code process metrics, we display the
   Before mining the raw data for metrics, we performed a                  distribution of achieved points per course and assignment in
data cleaning step. We observed that students used different               Figure 1. In AJP, the median score of 17 to 18 is quite high and
machines with varying Git configurations for their work. This              homogeneous over the course, however, there is some variability
resulted in multiple email identifiers for a student. Therefore,           with a standard deviation of 2.21. In the more advanced PKS
we inspected the repositories and added .mailmap1 files to                 course, the average score had a rising tendency with a median
consolidate the different identifiers. Then, we did the data               value of only 15 in the first assignment until a median of
mining with proprietary APIs of CodeScene2 , a tool for                    19 in the fourth. We assume that this is due to students
predictive analyses and visualizations to prioritize technical             having little prior knowledge in concurrency programming.
debt in large-scale code bases. The tool processes the individual          Additionally, the first assignment deals with low-level threading
Git repositories together with the information about the separate          mechanisms, which require a profound understanding. The
assignments. We customized the analysis to calculate metrics               students gradually improve their performance over the course
per assignment solution and selected the following metrics:                by gaining experience and because the later assignments deal
                                                                           with more convenient concurrency programming constructs.
   • Number of commits. The total number of commits
                                                                           The standard deviation of points is 1.88.
      related to the specific assignment.
   • Mean author commits. The mean number of commits                       A. Correlating Code Process Metrics with Points
      per author.                                                             To analyze the aforementioned code process metrics for
   • Mean commit message length. The mean number of
                                                                           correlations with the achieved points, we calculated the pairwise
      characters in commit messages, excluding merge commits.              Pearson Correlation Coefficient (PCC) between all features over
   • Number of merge commits. The number of merges.
                                                                           all solutions irrespective of the course. Surprisingly, we did not
   • Number of bug fixes. The number of commits with
                                                                           encounter any notable relationship between any of our metrics
     ‘bugfix’ or ‘fix’ in the commit message.                              and points, as can be seen in the ‘Overall’ column of Table II.
   • Number of refactorings. The number of commits with
     ‘refactor’ or ‘improve’ in the commit message.                                                     TABLE II
   • Author fragmentation. A metric describing how frag-                    P EARSON C ORRELATION C OEFFICIENT BETWEEN POINTS AND FEATURES
      mented the work on single files is across authors [5].
   • Days with commits. The number of days with at least                           Feature                           Overall    AJP     PKS
      one commit in the assignment period.                                         Mean Author Fragmentation          0.01      0.09    -0.25
                                                                                   Mean Commit Message Length         0.20      0.35    -0.06
   These metrics cover the most relevant aspects of the process.                   Mean Author Commits                0.05      0.04    -0.09
Unfortunately, we could not consider the size, i.e., the number                    Number of Commits                  0.04      0.05    -0.16
of additions and deletions of the commits, because the students                    Number of Merge Commits            0.08      0.10    -0.12
                                                                                   Number of Bug Fixes                0.04      0.09    -0.04
  1 https://www.git-scm.com/docs/git-check-mailmap                                 Days With Commits                  0.03      0.07    -0.11
  2 https://empear.com/                                                            Number of Refactorings             0.07      0.07     0.08


ISEE 2019: 2nd Workshop on Innovative Software Engineering Education @ SE19, Stuttgart, Germany                                                 24
           Fig. 2. The score relative to the number of commits                       Fig. 3. The score relative to the author fragmentation

   When looking at the two courses separately, however, in           Working on the same classes is not advisable. Another
AJP, one can solely see a moderate positive correlation of 0.35 metric we analyzed was the author fragmentation. It measures
between the commit message length and points. Interestingly, if the Java classes were written by a sole author (zero
this effect cannot be seen in PKS. There, we mainly see weak fragmentation) or collaboratively. In AJP there was again barely
negative correlations, which is also surprising, as it seems that a correlation of 0.09, whereas in PKS there was a weak negative
in contrast to AJP, more work does not lead to more points.       PCC value of −0.25. This is somewhat in line with findings
   More effort does not necessarily mean more points. in literature, where a lower fragmentation indicates a higher
While it is generally hard to directly quantify the effort using quality of the software [5]. When we have a closer look at the
our metrics, the combination of the number of commits and assignments in Figure 3, however, there is again a mixed signal
the days with commits are the best available proxy. Figure 2 of PKS Assignment 1 and 4 having a negative dependency,
shows the number of commits per course and assignment. The whereas Assignment 2 and 3 are relatively stable.
lines drawn on top of the data points are a linear regression        Finally, we refrain from analyzing the number of bug fixes
model that serves as a visual aid for the detailed trends in      and  refactorings because they were rarely used in the commits.
the data. Interestingly, there is no consistent trend observable
over the assignments or the course. AJP Assignment 1 has a B. Discussion
negative trend, indicating that those groups that managed to         We found no notable correlations between the analyzed code
solve the assignment with fewer commits got higher scores in process metrics and the quality of the assignments measured
that assignment. We attribute this to the prior knowledge of the via manual grading. This stands in contrast to the literature on
students at the start of the course. In the next two assignments software quality and code process metrics in industry [6]. So
of AJP, the trend is positive, whereas the number of commits what makes the assignments different from real world projects?
in the last assignment did not have an impact on the grading.        First of all, the timeframe differs. The assignments in our
In PKS, we see positive trends between both the number of courses all lasted for two weeks, whereas projects in industry
commits and days with commits with the points in the first span multiple months and even years. Furthermore, students
three assignments, while the last assignment shows a flat trend. focus solely on developing software, giving little thought on
Our interpretation of this matter is the following: Assignments how to run and maintain their software. What is more, the
that require much code to be written by the students benefit students had a good feeling when their assignment met the
from more commits, while it is the other way for assignments functional requirements and stopped working when they were
where the framework guides the development. Recall that in satisfied with their solution. Thus, equating the assignment
Assignment 4 of AJP, the task is to write a GUI using JavaFX, scores with the notion of high-quality software is most probably
and Assignment 4 of PKS is about using akka.io.                   not permissible in our courses.
   Distributing the work over a longer time span does                On the other hand, it might be that code process metrics
not increase the points. Generally, we assumed that starting simply require a certain effort by more developers to be put into
earlier and constantly working on the assignments, therefore, the code, which is not done by the small student groups during
accumulating more days with a commit would increase the the short assignment period. In industry projects, maintenance
score. However, this is not the case. In AJP, the PCC between work, i.e., bug fixing and refactoring, accounts for a large
this feature and points was 0.07, whereas in PKS it was even portion of the commits and overall quality of the software. By
a weak negative value of −0.11. We assume this has to do looking at the commit messages of the student assignments, we
with the limited temporal scope of only two weeks of working see that such efforts were rare. Also, communication problems
on the assignments. The more experienced groups might have become more of an issue in larger groups.
finished the assignment in a shorter time period and stopped         Finally, the student groups were quite new to the management
working when they thought that their solution was sufficient. of group programming tasks, especially in the third semester


ISEE 2019: 2nd Workshop on Innovative Software Engineering Education @ SE19, Stuttgart, Germany                                               25
course AJP. Since they could organize the development on their             limited timeframes code process metrics should not be used
own, there were myriads of different strategies. We believe                for the assessment of assignment solutions, since they are bad
that this lack of organizational requirement is a key point in             predictors for the score. Furthermore, when giving students
why we don’t see clear patterns in the code process metrics.               guidance about how to work on programming assignments, we
                                                                           can give suggestions such as to start early, prefer more small
                       IV. R ELATED W ORK                                  commits over fewer large commits, clearly separate tasks, but
   Our approach is a contribution to learning analytics for                they do not necessarily result in a better score.
which Greller et al. name two basic goals: prediction and                     We see time as a critical factor for the significance of
reflection [7]. The commit data we analyzed has a coarse                   code process metrics. Future work could therefore analyze
granularity compared to other work on programming education                development efforts with varying time frames to investigate
reviewed by Ihantola et al. [8], where the level of analysis is            our argument. Our paper is a first attempt at utilizing code
typically finer, for example key strokes. Our initial hope was             process metrics in programming education impacted by the
that code process metrics could have some predictive power                 characteristics of the courses we considered. This means there is
for student courses. This, however, was not the case despite               still potential in this topic and more research including different
several studies related to the quality and evolution of software           contexts, especially larger student projects, is desirable.
in industry [9]. Nagappan et al. found that the structure of
                                                                                                         R EFERENCES
the development organization is a stronger predictor of defects
than code metrics from static analysis [6], and Mulder et al.               [1] M. D’Ambros, H. Gall, M. Lanza, and M. Pinzger, Analysing Software
                                                                                Repositories to Understand Software Evolution. Berlin, Heidelberg:
identified several cross-cutting concerns of doing software                     Springer, 2008, pp. 37–67.
repository mining [10]. This paper is, thus, a parallel approach            [2] L. W. Dietz, J. Manner, S. Harrer, and J. Lenhard, “Teaching clean code,”
to static code analysis [2] or extensive test suites [11] for the               in Proceedings of the 1st Workshop on Innovative Software Engineering
                                                                                Education, Ulm, Germany, Mar. 2018.
evaluation of student assignments.                                          [3] A. Tornhill, Software Design X-Rays. Pragmatic Bookshelf, 2018.
   The metrics used stem from the work of Greiler et al. [12],              [4] S. Harrer, J. Lenhard, and L. Dietz, Java by Comparison: Become a
D’Ambros et al. [1], and Tornhill [3], [13]. As an example, in                  Java Craftsman in 70 Examples. Pragmatic Bookshelf, Mar. 2018.
                                                                            [5] M. D’Ambros, M. Lanza, and H. Gall, “Fractal figures: Visualizing
industry, the author fragmentation [5] is negatively correlated                 development effort for cvs entities,” in 3rd IEEE International Workshop
with the code quality. This is supported by Greiler et al. [12],                on Visualizing Software for Understanding and Analysis. IEEE, Sep.
who find that the number of defects increases with the number                   2005, pp. 1–6.
                                                                            [6] N. Nagappan, B. Murphy, and V. Basili, “The influence of organizational
of minor contributors in a module and Tufano et al. [14],                       structure on software quality: An empirical case study,” in Proceedings
who find that the risk of a defect increases with the number of                 of the 30th International Conference on Software Engineering, ser. ICSE
developers who have worked on that part of the code. However,                   ’08. New York, NY, USA: ACM, 2008, pp. 521–530.
                                                                            [7] W. Greller and H. Drachsler, “Translating learning into numbers: A
one can also go further and look at the commit metadata to                      generic framework for learning analytics,” Journal of Educational
capture the design degradation, as Oliva et al. did [15]. Our                   Technology & Society, vol. 15, no. 3, pp. 42–57, 2012.
approach therefore combines learning analytics with insights                [8] P. Ihantola, K. Rivers, M. Á. Rubio, J. Sheard, B. Skupas, J. Spacco,
                                                                                C. Szabo, D. Toll, A. Vihavainen, A. Ahadi, M. Butler, J. Börstler, S. H.
from industry. Since in realistic projects a developer rarely                   Edwards, E. Isohanni, A. Korhonen, and A. Petersen, “Educational data
programs alone, we found that the focus of our analysis should                  mining and learning analytics in programming,” in Proceedings of the
also be groups. This naturally limits us in drawing conclusions                 2015 ITiCSE on Working Group Reports. New York, NY, USA: ACM,
                                                                                2015, pp. 41–63.
about the learning process of an individual student.                        [9] M. D. Penta, “Empirical studies on software evolution: Should we (try
                                                                                to) claim causation?” in Proceedings of the Joint ERCIM Workshop on
                         V. C ONCLUSIONS                                        Software Evolution and International Workshop on Principles of Software
                                                                                Evolution. New York, NY, USA: ACM, 2010, pp. 2–2.
   While static code analysis has often been investigated in               [10] F. Mulder and A. Zaidman, “Identifying cross-cutting concerns using
educational settings, code process metrics from Git commits                     software repository mining,” in Proceedings of the Joint ERCIM Workshop
with a focus on groups represent a novel direction. We present                  on Software Evolution and International Workshop on Principles of
                                                                                Software Evolution. New York, NY, USA: ACM, 2010, pp. 23–32.
an approach for analyzing code process metrics based on                    [11] V. Pieterse, “Automated assessment of programming assignments,” in
Git commits from student assignments. However, from the                         Proceedings of the 3rd Computer Science Education Research Conference
interpretation of our results, we cannot identify any metric                    on Computer Science Education Research, ser. CSERC ’13. Heerlen,
                                                                                The Netherlands: Open Universiteit, 2013, pp. 45–56.
that has a significant correlation with the assignment scores              [12] M. Greiler, K. Herzig, and J. Czerwonka, “Code ownership and software
achieved by the students. Does this mean that code process                      quality: A replication study,” in IEEE/ACM 12th Working Conference
metrics are not useful for teaching programming? From our                       on Mining Software Repositories. IEEE, May 2015, pp. 2–12.
                                                                           [13] A. Tornhill, Your Code As a Crime Scene. Pragmatic Bookshelf, 2016.
experience, it is quite the contrary: We assume that the two               [14] M. Tufano, G. Bavota, D. Poshyvanyk, M. D. Penta, R. Oliveto, and A. D.
courses were not a realistic setting for profiting of good coding               Lucia, “An empirical study on developer-related factors characterizing fix-
practices. To be good software engineers in industry, students                  inducing commits,” Journal of Software: Evolution and Process, vol. 29,
                                                                                no. 1, Jun. 2016.
should learn how to write maintainable code, even if their code            [15] G. A. Oliva, I. Steinmacher, I. Wiese, and M. A. Gerosa, “What can
will be trashed after the semester. To establish good practices,                commit metadata tell us about design degradation?” in Proceedings of
code process metrics should play a larger role in practical                     the 2013 International Workshop on Principles of Software Evolution,
                                                                                ser. IWPSE 2013. New York, NY, USA: ACM, 2013, pp. 18–27.
software engineering courses, and could even be part of the
grading. In any case, in pure programming courses with very


ISEE 2019: 2nd Workshop on Innovative Software Engineering Education @ SE19, Stuttgart, Germany                                                         26

</pre>