Stager: Simplifying the Manual Assessment of Programming Exercises Christopher Laß, Stephan Krusche, Nadine von Frankenberg, Bernd Brügge Technische Universität München christopher.lass@tum.de, krusche@in.tum.de, nadine.frankenberg@in.tum.de, bruegge@in.tum.de Abstract programming exercises in large courses can take a Assessing programming exercises requires time and ef- considerable amount of time and effort. Automatic fort from instructors, especially in large courses with assessment systems (also called auto-graders) aim at many students. Automated assessment systems re- flexibility and scalability in large courses, and allow duce the effort, but impose a certain solution through to integrate exercises into lectures [Krusche et al., test cases. This can limit the creativity of students and 2017b]. These systems utilize, among others, version lead to a reduced learning experience. To verify code control systems (VCS) to store the code solutions of quality or evaluate creative programming tasks, the students in repositories and test cases that are exe- manual review of code submissions is necessary. How- cuted on a continuous integration server to assess ever, the process of downloading the students’ code, the solution to a programming exercise automatically identifying their contributions, and assessing their [Heckman and King, 2018; Krusche and Seitz, 2018]. solution can require many repetitive manual steps. While automated assessment systems significantly In this paper, we present Stager, a tool designed reduce manual assessment effort, they have draw- to support code reviewers by reducing the time to backs. Predefined test cases cannot cover all possible prepare and conduct manual assessments. Stager solutions and therefore impose a certain solution on downloads multiple submissions and adds the stu- the students. Some students are limited in their pro- dent’s name to the corresponding folder and project, gramming skills, while other students can exploit the so that reviewers can better distinguish between dif- test cases by repetitive trial-and-error submissions. ferent submissions. It filters out late submissions and Especially first year students who are new to program- applies coding style standards to prevent white space ming often experience problems when trying to for- related issues. Stager combines all changes of one mulate their solution and thoughts as an executable student into a single commit, so that reviewers can computer program [Robins et al., 2003]. Such sub- identify the student’s solution more quickly. missions can be overly complicated, and assessment Stager is an open source, programming language systems cannot (yet) provide enough useful feedback agnostic tool with an automated build pipeline for in that regard. Furthermore, some programming exer- cross-platform executables. It can be used for a va- cises cannot be assessed automatically. The automated riety of computer science courses. We used Stager grading of creative assignments with open problem in a software engineering undergraduate course with statements is hardly possible because different solu- 1600 students and 45 teaching assistants in three sep- tions exist [Knobelsdorf and Romeike, 2008; Krusche arate programming exercises. We found that Stager et al., 2017a]. An example for such an assignment improves the code correction experience and reduces is to implement a creative collision strategy in a 2D the overall assessment effort. racing game. Automated test cases could be able to validate a collision, but are incapable of assessing 1 Introduction the creativity or code quality of the solution. As a result, manual assessment can be beneficial, even in The number of students in university courses is in- large courses that have fully implemented automated creasing. The number of new undergraduate students grading solutions. at our computer science department increased by 81 % However, the process of manually assessing multi- between 2013 (1110 students) and 2017 (2005 stu- ple students’ solutions requires repeated manual steps. dents)1 . Practical programming exercises are essential Tasks such as finding the next student’s repository, in computer science education and help students ac- downloading the source code, and renaming the fold- quire important skills in software development [Staub- ers and projects names for standardization can be itz et al., 2015]. However, a manual assessment of time-consuming and error-prone. Determining a stu- 1 https://www.tum.de/die-tum/die-universitaet/ dent’s contribution is challenging when the exercise die-tum-in-zahlen/studium builds upon a provided code template and when the V. Thurner, O. Radfelder, K. Vosseberg (Hrsg.): SEUH 2019 34 Stager: Simplifying the Manual Assessment of Programming Exercises Christopher Laß, Stephan Krusche, Nadine von Frankenberg und Bernd Bruegge, TU München students use multiple commits in their code repository. can therefore be less time consuming to manually as- Then it becomes difficult to separate the provided sess solutions rather than to design the automated template and the final solution. assessment system [Ala-Mutka, 2005]. In this paper, we present Stager, a tool that is de- Further, students can become distracted by au- signed to support the manual assessment of program- tomated feedback. For instance, students may be ming exercises. Reviewers, e.g. teaching assistants or tempted to fix only the failing tests instead of focus- instructors, can automate the manual steps that are ing on the assignment [Heckman and King, 2018]. necessary to prepare the students’ code repositories, Automated assessment systems circumvent the de- for instance download all repositories at once, and tection of frequent mistakes or misunderstandings thereby reduce the manual assessment time. The idea among students. The understanding and resolution for Stager evolved during an undergraduate university of common errors is an essential learning experience course with 1600 students and 45 teaching assistants. for students. Semi-automated systems combine the An initial implementation was used for three separate mentioned aspects by providing automated grading, programming assignments. as well as manual feedback. Such systems offer per- The remainder of the paper is organized as follows. sonalized feedback to some extent, for instance the We describe related work focusing on existing auto- instructor can annotate a static assessment [Gerdes mated assessment solutions and the limitations of et al., 2017]. Other systems give the student instant automated assessment approaches in Section 2. In feedback if the student’s solution is correct. If it is Section 3, we cover Stagers’ approach to automat- not, the instructor reviews each solution and can give ing the recurring manual steps during the correction additional feedback if required [Insa and Silva, 2015]. of programming exercises. We describe design deci- Many systems focus on the grading itself, but not on sions, the exercise workflow with Stager, the configu- the process the instructor has to follow to obtain the ration possibilities of the tool, and the concrete tasks students’ solutions. of Stager, e.g. the Download repositories task. We ana- Some commercially available systems and tools that lyze the improved code assessment experience of the are used in computer science (CS) courses offer fea- teaching assistants by means of an experience report tures that aim at simplifying this process. In 2000, in Section 4, where we also present the results of a Jackson proposed an approach that pre-processes stu- quantitative analysis of Stager’s use in three program- dent submissions (sent via e-mail) by removing irrele- ming exercises. Section 5 concludes the paper and vant information or unpacking files [Jackson, 2000]. provides directions for future work. For submissions via repositories, pull requests (also called merge requests) in GitHub2 , GitLab3 , or Bit- 2 Related Work bucket4 allow students to commit their changes into separate branches. After requesting the code to be Several automated assessment system approaches for merged into the main branch, i.e. a submission, re- programming assignments exist [Heckman and King, viewers can highlight the student’s contribution as 2018; Knobelsdorf and Romeike, 2008; Krusche and difference to the template code and provide feedback Seitz, 2018; Pieterse, 2013]. Advantages include a de- by requesting changes. While pull requests can also crease in the workload of course instructors and timely be integrated with continuous integration systems, feedback for students [Pieterse, 2013]. Automated e.g. using TravisCI5 to detect compile errors and to systems work well to grade programming assignments run automated tests, reviewers might still need to consistently and evaluate specific aspects, e.g. the download the source code and execute it to verify if functionality [McCracken et al., 2001] or efficiency of all requirements of the problem statement have been a system [Jackson and Usher, 1997]. However, they solved. are missing the benefit of personal feedback which GitLab introduced a “Squash and Merge” option a manual grading approach could provide. The test which “applies all of the changes in a merge request cases used by such systems cannot assess the code as a single commit, and then merges that commit us- quality and “elegance” of the solution [Poženel et al., ing the merge method set for the project”6 . This cleans 2015]. up the commit history and can make it easier to iden- Building a robust automated assessment system tify the contribution of one particular student. Tools amounts to a heavy workload, whereby the definition and services, such as Gerrit7 , support code reviews of the test cases is (usually) the most time consuming that enable the reviewer to see the code difference, activity [Cerioli and Cinelli, 2008]. This workload is and provide the option to leave in-line comments. amplified when designing tasks with some degree of freedom of solutions [Chen, 2004]. The degree of free- 2 https://github.com 3 https://gitlab.com dom of solutions indicates the difficulty of the exercise 4 https://bitbucket.org [Striewe and Goedicke, 2013], meaning that a diffi- 5 https://travis-ci.org cult exercise has more possible solutions and therefore 6 https://docs.gitlab.com/ee/user/project/merge_ has an increased workload to design the automated requests/squash_and_merge.html assessment system. Depending on the class size, it 7 https://www.gerritcodereview.com V. Thurner, O. Radfelder, K. Vosseberg (Hrsg.): SEUH 2019 35 Stager: Simplifying the Manual Assessment of Programming Exercises Christopher Laß, Stephan Krusche, Nadine von Frankenberg und Bernd Bruegge, TU München However, such tools primarily focus on continuous Reviewer Stager Student feedback rather than assessing a student’s solution. 1.1 Receive exercise and 3 Stager’s Approach code template This section presents an approach that automates man- 1.2 Solve ual steps during the correction of programming ex- exercise ercises in order to prepare student repositories for easier assessment. We show how code reviewers can 2.1 Configure 1.3 Commit and Stager push solution use Stager. Furthermore, we explain the different tasks that are automatically executed by Stager. 2.2. Trigger Figure 1 illustrates the exercise workflow including Stager the manual assessment with the help of Stager as a 3.1 Download repositories UML activity diagram. As precondition for this work- flow, every student must have their own repository 3.2 Rename with the code template for the exercise in a VCS8 . Af- folders ter the students complete the exercise, they commit and push their solutions to the VCS (action 1.3). 3.3 Filter late Before reviewers start to work, they need to config- submissions ure Stager (action 2.1). Then, they trigger Stager to process different tasks (actions 3.1 ... 3.6), such as 3.4 Rename projects Download repositories or Normalize code style. Finally, the reviewer can manually assess the pre-processed 3.5 Normalize submissions and give qualitative feedback (action 4.2 code style and 5.) to the students in any arbitrary form (e.g. uploading the feedback into an exercise management 4.1 Manually 3.6 Combine system such as Moodle9 ). assess commits prepared The action 2.1 Configure Stager of the Reviewer is repositories described in Section 3.1. Stager’s actions are described as tasks in Section 3.2. The numbering in Section 3.2 4.2 Give 5. Receive qualitative qualitative aligns with the corresponding action in Figure 1. feedback feedback 3.1 Stager’s Setup Stager is free, open source, and available under the MIT license10 . It is platform independent and pro- Figure 1: Exercise workflow with Stager: students gramming language agnostic, making Stager univer- complete the exercise and upload their solutions to a sally applicable. It is written in the Go programming VCS. The reviewer configures and triggers Stager to language11 and makes use of the distributed version process different tasks, e.g. 3.1 Download repositories. control system git12 . Cross-platform executables can Afterwards, the reviewer manually assesses the pre- be downloaded from the automatic build pipeline or pared repositories and gives qualitative feedback to compiled from the source code. the students. Stager’s configuration is separated into two files students.csv and config.json, based on how frequently password have to be set with valid credentials and the settings change. The list of students in students.csv access rights to the VCS. For SSH, Stager uses the might not change during the course duration, while operating system’s global SSH settings and therefore config.json changes for every exercise. The configu- does not require further configuration. ration procedure must be completed after the code 2. Latest commit hash of a programming exer- template was finished and before Stager is executed. cise template: The programming exercises that are Stager or its configuration does not add any precon- distributed to the students build upon a given code ditions or constraints on the students. The following template. The SHA hash of the latest commit for settings can be edited: the code template, meaning the latest code changes 1. Credentials: Remote git repositories can be ac- the reviewer included, must be set for the JSON key cessed via the SSH or HTTP protocols [Lawrance et squash_after. This setting is required for Stager to al., 2013]. For HTTP, the JSON keys username and distinguish between the given code by the reviewer 8 There are multiple tools available that automate this step, e.g. and code written by the student. This configuration ArTEMiS, Github Classroom, etc. option is used by the task Combine commits and is 9 https://moodle.org 10 https://github.com/arubacao/stager further elaborated in Section 3.2. 11 https://golang.org 3. Deadline for homework submission: Students 12 https://git-scm.com have to submit their homework in a given time-frame. V. Thurner, O. Radfelder, K. Vosseberg (Hrsg.): SEUH 2019 36 Stager: Simplifying the Manual Assessment of Programming Exercises Christopher Laß, Stephan Krusche, Nadine von Frankenberg und Bernd Bruegge, TU München For example, the homework must be submitted by purpose. For example, the Rename folders task ap- Sunday midnight because the programming exercises pends the student’s name to the corresponding folder. will be discussed in class on Monday morning. How- Stager is composed of multiple tasks (shown in the ever, VCSs have limitations when it comes to time- Stager swimlane in Figure 1) that adhere to certain based repository access. As described in more detail rules and are sequentially performed during the tool’s in Section 3.2, the task Filter late submissions allows execution. The implementation allows a clear distinc- to overcome these VCSs limitations. The deadline for tion of tasks, such that each task addresses a separate students submitting their homework is set with the purpose. Therefore, it is easy to add new tasks or re- JSON key deadline. The standard datetime format move existing ones conceptually and implementation- YYYY-MM-DD HH:MM:SS must be used. For example, wise in the future. For example, when the reviewer 2018-08-31 23:59:59 is valid. does not need a certain task, only one line of code 4. Remote repository URL schema: Each student within the array of tasks has to be removed. Fur- has a personal repository that can be accessed with thermore, tasks must be idempotent, meaning that a unique URL. A general URL schema can be derived multiple executions of the task lead to the same out- from these unique URLs, where the students’ identi- put. Even though tasks are independent, they are fiers are substituted by a placeholder. For example, for processed sequentially, i.e. the order of the tasks is the repository URL (1) of student 10001, the derived relevant. For instance, repositories first have to be general URL schema is (2). If the repositories are ac- downloaded before other tasks have local file access. cessed using HTTP as in the example, two additional The goal of Stager is to simplify the manual assess- placeholders must be set for the reviewer’s credentials ment of programming exercises by modifying source (3). The resulting schema is set for the key url. code, files, and repositories. Repetitive manual steps that are required for the reviewer to start the assess- ment should be reduced or eliminated by Stager. We https://repo.uni/cs101/exercise01-10001.git (1) identified the following relevant tasks (listed accord- https://repo.uni/cs101/exercise01-%s.git (2) ing to the order of execution) and describe each of https://%s:%s@repo.uni/cs101/exercise01-%s.git (3) them in detail in the following: 1. Download repositories 5. List of students: In addition to the mentioned settings, Stager requires a list of students the reviewer 2. Filter late submissions wants to assess. The students’ names and identifiers 3. Rename folders are defined in the students.csv file with the format 4. Rename projects shown in Listing 1. All mentioned people and courses in this paper are placeholder names and do not exist 5. Normalize code style in reality. 6. Combine commits Listing 1: Sample students.csv 1. Download repositories: In order to better de- name , i d termine the software quality and verify if all require- John Doe ,10001 ments of the problem statement have been solved by Jane Roe ,10002 the students’ submissions, it is necessary for the re- viewer to compile and execute their homework source After configuration, the Stager executable, con- code locally. Hence the repositories must be available fig.json, and students.csv are placed in a dedicated on the reviewer’s computer. The initial task clones all and empty folder. Stager can then be executed via a repositories of the predefined students as-is and all at double click or from the terminal. Listing 2 illustrates once to a given folder on the reviewer’s computer. This this workflow. After Stager terminates, the students’ first task takes potential existing local repositories into repositories are locally available and prepared by the account and overwrites them. It ensures that each lo- tasks described in the following Section 3.2. cal repository is in sync with the remote repository Listing 2: Folder setup and execution of Stager and in a clean state. The following tasks modify files and therefore $ cd ~/cs101 / a s se s s m e nt 3 require write access to the repositories. These $ ls modifications can only be performed when the config . json stager students . csv repositories are locally available. Consequently, the $ ./ stager Download repositories task must be first. 3.2 Stager’s Tasks 2. Filter late submissions: Homework submis- Stager provides an extendable framework which sions are tied to a hard deadline. With web-based makes it easy to add or remove tasks according to VCSs like Bitbucket or GitLab, it is hardly possible to the reviewer’s requirements. Tasks are functions that block student commits after a given deadline. Stu- modify the repository or its contents and have a single dents could exploit this situation and extend their V. Thurner, O. Radfelder, K. Vosseberg (Hrsg.): SEUH 2019 37 Stager: Simplifying the Manual Assessment of Programming Exercises Christopher Laß, Stephan Krusche, Nadine von Frankenberg und Bernd Bruegge, TU München time to finish the exercise as shown in Figure 2. The distinguish between students within source code edi- Filter late submissions task analyzes the commit times- tors or integrated development environments (IDEs), tamps and sets the repository to the state of the pre- e.g. Eclipse13 (Figure 4), and allows to import mul- configured deadline in config.json. Commits after the tiple projects at the same time. Eclipse, for instance, deadline are not considered anymore. This way time- does not allow to import multiple projects with iden- based limitations of web-based VCSs are bypassed. tical names, which makes it impossible to compare However, this procedure is not fully forgery-proof, multiple solutions without renaming the projects. since commit timestamps can be manipulated. File changes made prior to this task would be striped out, since the repository is set to the state of the pre-configured deadline. Therefore, the Fil- ter late submissions task must be executed before any other task can modify files. Figure 4: Prepend student names to projects so that the submissions of multiple students can be imported into Eclipse and reviewed at the same time. Jane Roe and John Doe are prepended to the project name. Otherwise the reviewer could only import one Eclipse Figure 2: Filter late homework submissions by exclud- project at the same time. ing commits after the homework submission deadline. The two commits above the red line are after the 5. Normalize code style: The encoding and code deadline while the two commits below the red line style of the provided code template and the final are before the deadline. student’s contribution should be consistent. Windows and Unix-based systems use different line breaks for 3. Rename folders: Depending on the naming code files by default. Windows uses carriage return convention, only the student’s identifier is used for and line feed “\r\n” as a line ending, whereas Unix the repository name. The resulting folders can be based systems use just line feed “\n”. Also, IDEs hard to keep separate and to associate with the correct might automatically enforce a different code style student. For obfuscation and identity protection this standard than desired. As illustrated in Figure 5, this is reasonable, but counterproductive on the reviewer’s could lead to non-relevant changes and obscured local system since it is easier to identify a student code differences in commits, thereby making it harder by their name and not through their id. Once the to assess the submission. To avoid these non-relevant repositories are locally available, the Rename folders file changes by the student, Stager invokes a linter task appends the student’s name to the corresponding that automatically normalizes the code to the same folder as illustrated in Figure 3. standards as the initial template. This means that all white space related changes, e.g. line breaks, empty spaces and tabs are removed, so that the reviewer does not need to analyze them. Each programming language has its own linting strategies, utilizing existing tools like eslint14 for Javascript or checkstyle15 for Java. This hides pure white space and encoding Figure 3: Append names to folders to better distin- changes and allows code reviewers to focus on the guish between students. Without the names john_doe actual contributions by the students. and jane_roe, it would be difficult to identify which folder belongs to which student. 6. Combine commits: Reviewers provide code templates as a starting point for the programming 4. Rename projects: As precondition of Stager, exercise, in which the student has to make changes each student must have their own repository for each across multiple files. These changes can be small com- published exercise. The content of these repositories pared to the provided template and consequently hard is always identical. As a result, the project names to identify by the reviewer. In order to determine the are also identical for all students. This leads to the student’s contribution more effectively, it is helpful to problem that reviewers could only import one project see the exact difference between the template and the at the same time into Eclipse in order to review and final submission instead of only looking at the final execute the code. Renaming all projects manually is submission. VCSs provide easy comparison methods time-consuming and error-prone. Analogue to the Re- 13 https://www.eclipse.org name folders task, a student’s name is prepended to the 14 https://github.com/eslint/eslint corresponding project name. This makes it possible to 15 https://github.com/checkstyle/checkstyle V. Thurner, O. Radfelder, K. Vosseberg (Hrsg.): SEUH 2019 38 Stager: Simplifying the Manual Assessment of Programming Exercises Christopher Laß, Stephan Krusche, Nadine von Frankenberg und Bernd Bruegge, TU München 4 Experience Report The following experience report describes the lecture- based course Introduction to Software Engineering (EIST16 ) in which we used Stager to improve the man- ual assessment of programming exercises. EIST is a second semester bachelor’s course with a heteroge- neous group of students including computer science, business informatics, and business students. Figure 5: There is no visual change in the two code The course assumes that students have successfully blocks in this figure. However, non-visible line breaks completed an introductory course in computer science cause the comparison tool to show these lines. This (e.g. CS1) and are familiar with object-oriented pro- can make it time-consuming for the reviewer to iden- gramming in Java. The course’s learning goals are tify relevant changes. that students are able to apply relevant concepts and methods in all phases of software engineering projects including analysis, design, implementation, testing, and delivery. Further, students know the most im- portant terms and concepts and can apply them in where the difference made by a single commit is vis- modeling and programming tasks. They are aware ible. However, a submission can consist of multiple of the problems and issues that generally have to be commits. The reviewer would have to compare each considered in software engineering projects. Table 1 commit and memorize the changes themselves, which shows the schedule and the content of the course. makes the standard comparison method impractical and error-prone. Week Content 1 Introduction The combine commits task combines the students 2 Model-Based Software Engineering commits into one single commit. The reviewer does 3 Requirements Elicitation and Analysis not need to review multiple changes within the same 4 System Design I code line and can omit changes that have been added in one commit and removed again in a later commit. 5 System Design II This single commit also contains all Stager related 6 Object Design changes (e.g. white space changes). As a result, it is 7 Model Transformations and Refactorings easy for the reviewer to quickly identify the student’s 8 Pattern-Based Development contribution and to decide if the solution is correct. 9 Lifecycle Modeling In addition to the existing branches with the complete 10 Software Configuration Management commit history, Stager adds the combined commit into 11 Testing a separate branch. Thus, information is only added 12 Project Management and not removed from the repository and the reviewer 13 Repetitorium could still see the whole commit history. Web-based VCSs like GitHub also offer a squash feature, however, Table 1: The course Introduction to Software Engineer- the reviewer would have to trigger it manually for ing lasts 13 weeks. each repository. 1600 students were registered for the course in Figure 6 illustrates this process with an example 2018. One lecturer and three exercise instructors student John Doe and an Instructor. The Instructor were involved in the organization of the course. 45 provides a code template. John Doe works on the teaching assistants were responsible for holding 74 given exercise. Over a period of one day, John submits exercise group sessions per week. Teaching assistants his work separated across multiple commits. As seen were mainly bachelor students in the fourth semester, in the bottom right corner of Figure 6, one assignment who successfully completed the same course in the was to Add new car types to the game. Since John previous year. submitted multiple code changes and removed the The course design is based on interaction and as- “TODO” lines within the code, the reviewer would sumes active participation from students. The interac- have to actively scan all nine commits to identify tive parts include in-class exercises, in-class quizzes, John’s solution. Stager solves this time-consuming and exercise sessions. Students need to bring their process by combining all student commits into one laptops to the class and to exercise sessions. Stu- single commit that includes all changes by John. This dents can earn bonus points for completing in-class single commit is selected in the top of Figure 6. The and homework exercises successfully. They can use reviewer can see every file that has been modified by these bonus points to improve their final exam grade. the student and quickly identify, whether John has completed the assignment correctly. 16 The German title is “Einführung in die Softwaretechnik”. V. Thurner, O. Radfelder, K. Vosseberg (Hrsg.): SEUH 2019 39 Stager: Simplifying the Manual Assessment of Programming Exercises Christopher Laß, Stephan Krusche, Nadine von Frankenberg und Bernd Bruegge, TU München Figure 6: Student commits are combined into one discrete change set: the commit at the top highlighted in blue. This commit displays the difference between a provided code template by the instructor and the submitted solution by the student. All commits of the student John Doe are still available. For instance, if they score more than 90 % of the to- different design patterns to make the game extensible tal exercise points, their grade in the final exam is for new requirements. improved by 1.0. This possibility motivates the stu- To submit their solutions, the students commit their dents to participate in the in-class exercises and in changes to a version control system. This automati- the homework exercises. In-class exercises consist cally triggers test cases on a continuous integration of quizzes (similar to the quiz exercises described in server to verify the given solution. After the submis- [Krusche et al., 2017c]), modeling and programming sion of their solution, students can automatically see exercises. Homework exercises include modeling, text the test results as individual feedback and improve and programming exercises. their solution according to this feedback. 4.1 Programming Exercises 1500 Participations in Homework Programming Exercises Between 600 and 1200 students have actively partici- 1000 1.070 1.104 pated in each programming exercise throughout the 824 794 semester which is shown in Figure 7 and Figure 8. In 500 637 each exercise, the students had to write new source code or adjust existing code based on a given prob- 0 lem statement. All students worked on the existing H01 H02 H07 H08 H11 template code of an exercise in their individual git repository. The exercises were based on a 2D rac- Figure 7: Number of students who submitted solutions ing game called Bumpers. In the game, cars collide to homework programming exercises with each other and each collision has a winner. The course is designed so that each week’s exercises focus However, not all aspects of a problem statement on a different part of Bumpers in accordance with the can be automatically tested. Either it is difficult to test lecture’s content, e.g. in week 8, “Pattern-Based De- a certain aspect of a solution, for instance complex velopment”, exercises include the implementation of behavior tests, or the problem statement provides a V. Thurner, O. Radfelder, K. Vosseberg (Hrsg.): SEUH 2019 40 Stager: Simplifying the Manual Assessment of Programming Exercises Christopher Laß, Stephan Krusche, Nadine von Frankenberg und Bernd Bruegge, TU München 1500 submission metrics. The number of commits per stu- Participations in In-Class Programming Exercises dent varies from 1.81 to 5.91 on average. Stager’s 1.213 1000 combine commits task will combine student commits 896 863 into one single commit so that reviewers can distin- 776 765 500 guish the difference between the provided code tem- plate and code submitted by the student immediately. 0 There are respectively 34, 8, and 7 late submissions L02 L07 L08 L10 L11 for the observed exercises. Stager will automatically filter commits that are contributed after the defined Figure 8: Number of students who submitted solutions exercise deadline. There are between 118 and 183 to in-class programming exercises students that submitted at least one commit where they only changed white spaces. While reviewing the high degree of freedom which makes it difficult to student contributions, white space related changes write test cases, e.g. open or visionary questions. are visually distracting to the reviewer (see Figure 5), The following three homework programming ex- since these changes are not relevant to the exercise. ercises required manual assessment by the teaching In informal discussions, seven teaching assistants assistants. The second and third exercises were graded reported that Stager reduced their reviewing effort semi-automated. significantly. The workflow without Stager required 1. Collision Detection: The task was to implement the teaching assistants to first filter the repositories a creative collision detection algorithm for cars in by student, then to check the commit dates and times, Bumpers. The students were given executable tem- clone or download the code, and to fix potential white plate code and had to extend it with a new class that space problems in order to be able to assess the actual included their solution. This exercise required manual sumbission. Depending on the amount of exercise ses- correction to test whether the new collision algorithm sions, teaching assistants had to perform this manual performed as intended. Additionally, the most creative workflow for up to 50 student submissions. Further, solutions were awarded and shown in class. the repository names only include the student’s iden- 2. Serialization of Code: The students had to in- tifiers, not names, so that mix-ups could occur when stantiate objects from two classes in Java. The main importing the solutions into an IDE. task was to serialize and deserialize these object using 4.3 Discussion JSON. An automated assessment system was used to test the input and output of the serialization. How- While using Stager, we identified four main advan- ever, the students wrote their own serialization code, tages: (1) Combining commits is particularly helpful so their solutions varied, e.g. in the naming of the ob- to review all changes of one student at a glance. This jects or methods. This required the teaching assistants allows the reviewer to immediately identify whether to assess the implementations manually. the student has understood the problem statement 3. Adapter Pattern: Based on a code template, the and has implemented a proper solution. (2) Renam- assignment was to extend the 2D car racing game ing the projects simplifies the assessment and compar- Bumpers with legacy code using the adapter pattern. ison of multiple solutions. The reviewer can import The legacy code for an existing analog speedome- multiple solutions at the same time with one click into ter panel was provided separately. An automated an IDE. It increases the confidence of the reviewers, so assessment system graded the students’ solution. In that the assessment is associated with the correct stu- addition, the teaching assistants had to verify if the dent. (3) While most students follow the deadline of speedometer panel was shown in the game user inter- an exercise, some students have committed changes face and displayed the velocity correctly. after the deadline. It would be possible to remove write permissions for all student git repositories at the 4.2 Results given deadline, but this might be hard to realize. En- In order to determine how many manual steps during forcing the deadlines in Stager is easier and filters the a homework assessment can be automated by Stager, cases where students try to circumvent the deadline. we conducted a quantitative analysis for these three (4) Stager only depends on using git repositories for programming exercises. For the qualitative analysis programming exercises and other instructors can use we focused on: it without adaptions in their courses, e.g. in GitHub Classroom or other git environments17 . As Stager is 1. Number of commits per student open-source, other instructors can adapt it to their 2. Number of commits after the exercise deadline own needs. 3. Source code changes where only white spaces While Stager is easy to use as a standalone tool, have been added or removed reviewers need to configure it for each exercise as described in Section 3.1. It would further simplify the Table 2 displays an overview of the number of par- ticipating students for each exercise together with 17 https://classroom.github.com V. Thurner, O. Radfelder, K. Vosseberg (Hrsg.): SEUH 2019 41 Stager: Simplifying the Manual Assessment of Programming Exercises Christopher Laß, Stephan Krusche, Nadine von Frankenberg und Bernd Bruegge, TU München Metric 1. Collision Detection 2. Serialization of Code 3. Adapter Pattern Total submission count 1104 657 794 Total commit count 1998 3880 2447 Average amount of commits per student 1.81 5.91 3.08 Total commits after exercise deadline 34 8 7 Total submission count with at least 125 118 183 one white space related change Table 2: Quantitative analysis of submission metrics for three programming exercises of the course configuration if Stager would be integrated into the fixes typical white space problems. All commits of one exercise management system, where the instructor student are combined into one discrete change-set sets up the programming exercise. Then Stager would that is easier to review. Code reviewers can better dis- automatically know the submission deadline, the lat- tinguish between the submissions of multiple students est commit of the instructor in the code template, and and identify students’ contributions more quickly. the remote repository URL. This would make the use Our experience in a course with 1600 students and of Stager easier and seamlessly. 45 teaching assistants shows that Stager reduced the reviewing effort and time for teaching assistants. The 4.4 Limitations reviewers used the saved time to write better reviews Our experience report only included three exercises and give more detailed feedback to the students. This that used Stager for code reviews. It would be in- improved the student’s learning. A quantitative analy- teresting to analyze the concrete time-savings with a sis in three programming exercises shows that Stager comparison and to use Stager throughout the whole identifies several late submissions and fixes many course. While we have first indications, we did not white space issues. evaluate whether the quality of the reviews improved Stager is free, open source, and available under the through the use of Stager. MIT license, so that other instructors can use it in their In addition, Stager’s implementation currently has courses18 . We will continue the development and aim the following limitations: (1) Reviewers have to man- to integrate the tool into the automated assessment ually search for each student repository’s key the first system ArTEMiS [Krusche and Seitz, 2018]. Our fu- time they use Stager, before being able to use Stager ture work also includes the integration of code quality for the remaining steps. The previously mentioned metrics to support the actual code assessment. This integration of Stager into an exercise management could make it easier for reviewers to spot code quality system would overcome this step. (2) For every exer- issues in the students’ solutions and be included, e.g. cise, the config.json file has to be changed accordingly as a text file, into the feedback pipeline. with the deadline, URL-schema, and commit of the In addition, we would like to evaluate the quality instructor. This could also be adapted to be automat- of the code reviews when using Stager compared to ically included when creating exercises by means of pure manual reviews with respect to the completeness, an exercise management system. (3) Reviewers have helpfulness, and understandability of the review. De- to install Stager on their computer and start it via a pending on the results of this evaluation, we could in- double-click or the command line interface. A web- tegrate strategies to semi-automatically propose com- based solution or a plugin into an IDE (e.g. Eclipse) mon code review feedback. Automatic suggestions in which the reviewers import the code would provide would further reduce the effort of reviewers but al- a more user-friendly experience. low them to tailor these suggestions to the concrete situation. 5 Conclusion Manual code reviews are important for the learning References experience of students. While automatic tests can find [Ala-Mutka 2005] A LA -M UTKA, Kirsti M.: A Survey typical problems and check whether code works as of Automated Assessment Approaches for Program- intended, they cannot find all problems, code smells, ming Assignments. In: Computer Science Education and implementation issues. Automatic assessment 15, pages 83–102, 2005. imposes certain solutions on the students and might limit their creativity. Stager supports code review- [Cerioli and Cinelli 2008] C ERIOLI, Maura ; C INELLI, ers by automating steps in the manual assessment of Pierpaolo: GRASP: Grading and Rating ASsistant programming exercises to reduce effort for the prepa- Professor. In: Proceedings of the Informatics Educa- ration and the conduction of code reviews. Stager tion Europe III Conference, 2008. downloads multiple students’ submissions, renames folders and projects, filters out late submissions, and 18 https://github.com/arubacao/stager V. Thurner, O. Radfelder, K. Vosseberg (Hrsg.): SEUH 2019 42 Stager: Simplifying the Manual Assessment of Programming Exercises Christopher Laß, Stephan Krusche, Nadine von Frankenberg und Bernd Bruegge, TU München [Chen 2004] C HEN, P. M.: An automated feedback the 19th Australasian Computing Education Confer- system for computer organization projects. In: IEEE ence, pages 17–26, 2017. Transactions on Education 47, pages 232–240, 2004. [Krusche et al. 2017c] K RUSCHE, Stephan ; VON F RANKENBERG, Nadine ; A FIFI, Sami: Experiences of [Gerdes et al. 2017] G ERDES, Alex ; H EEREN, Bas- a Software Engineering Course based on Interactive tiaan ; J EURING, Johan ; B INSBERGEN, L. T. van: Learning. In: Tagungsband des 15. Workshops "Soft- Ask-Elle: an Adaptable Programming Tutor for ware Engineering im Unterricht der Hochschulen", Haskell Giving Automated Feedback. In: Interna- pages 32–40, 2017. tional Journal of Artificial Intelligence in Education 27, pages 65–100, 2017. [Lawrance et al. 2013] L AWRANCE, Joseph ; J UNG, Seikyung ; W ISEMAN, Charles: Git on the Cloud [Heckman and King 2018] H ECKMAN, Sarah ; K ING, in the Classroom. In: Proceeding of the 44th ACM Jason: Developing Software Engineering Skills Us- Technical Symposium on Computer Science Education, ing Real Tools for Automated Grading. In: Pro- pages 639–644, 2013. ceedings of the 49th ACM Technical Symposium on Computer Science Education, pages 794–799, 2018. [McCracken et al. 2001] M C C RACKEN, Michael ; [Insa and Silva 2015] I NSA, David ; S ILVA, Josep: A LMSTRUM, Vicki ; D IAZ, Danny ; G UZDIAL, Semi-Automatic Assessment of Unrestrained Java Mark ; H AGAN, Dianne ; KOLIKANT, Yifat Ben- Code: A Library, a DSL, and a Workbench to Assess David ; L AXER, Cary ; T HOMAS, Lynda ; U TTING, Exams and Exercises. In: Proceedings of the Con- Ian ; W ILUSZ, Tadeusz: A Multi-national, Multi- ference on Innovation and Technology in Computer institutional Study of Assessment of Programming Science Education, pages 39–44, 2015. Skills of First-year CS Students. In: Working Group Reports on Innovation and Technology in Computer [Jackson 2000] JACKSON, David: A semi-automated Science Education, pages 125–180, 2001. approach to online assessment. In: SIGCSE Bulletin 32, pages 164–167, 2000. [Pieterse 2013] P IETERSE, Vreda: Automated As- sessment of Programming Assignments. In: Proceed- [Jackson and Usher 1997] JACKSON, David ; U SHER, ings of the 3rd Computer Science Education Research Michelle: Grading Student Programs Using ASSYST. Conference, pages 45–56, 2013. In: Proceedings of the 28th Technical Symposium on Computer Science Education, pages 335–339, 1997. [Poženel et al. 2015] P OŽENEL, Marko ; F ÜRST, Luka ; M AHNI Č, Viljan: Introduction of the auto- [Knobelsdorf and Romeike 2008] K NOBELSDORF, mated assessment of homework assignments in a Maria ; R OMEIKE, Ralf: Creativity As a Pathway university-level programming course. In: 38th In- to Computer Science. In: Proceedings of the 13th ternational Convention on Information and Commu- Annual Conference on Innovation and Technology in nication Technology, Electronics and Microelectronics, Computer Science Education, pages 286–290, 2008. pages 761–766, IEEE, 2015. [Krusche et al. 2017a] K RUSCHE, Stephan ; B RUEGGE, Bernd ; C AMILLERI, Irina ; K RINKIN, Kir- [Robins et al. 2003] R OBINS, Anthony ; R OUNTREE, ill ; S EITZ, Andreas ; W ÖBKER, Cecil: Chaordic Janet ; R OUNTREE, Nathan: Learning and teaching Learning: A Case Study. In: Proceedings of the 39th programming: A review and discussion. In: Com- International Conference on Software Engineering: puter Science Education 13, pages 137–172, 2003. Software Engineering Education and Training Track, pages 87–96, IEEE, 2017. [Staubitz et al. 2015] S TAUBITZ, Thomas ; K LE - MENT , Hauke ; R ENZ, Jan ; T EUSNER , Ralf ; M EINEL, [Krusche and Seitz 2018] K RUSCHE, Stephan ; Christoph: Towards practical programming exer- S EITZ, Andreas: ArTEMiS: An Automatic Assess- cises and automated assessment in Massive Open ment Management System for Interactive Learning. Online Courses. In: Teaching, Assessment, and Learn- In: Proceedings of the 49th ACM Technical Sympo- ing for Engineering, pages 23–30, IEEE, 2015. sium on Computer Science Education, pages 284–289, 2018. [Striewe and Goedicke 2013] S TRIEWE, Michael ; G OEDICKE, Michael: Analyse von Programmier- [Krusche et al. 2017b] K RUSCHE, Stephan ; S EITZ, aufgaben durch Softwareproduktmetriken. In: Andreas ; B ÖRSTLER, Jürgen ; B RUEGGE, Bernd: In- Tagungsband des 13. Workshops "Software Engineer- teractive Learning: Increasing Student Participation ing im Unterricht der Hochschulen", pages 59–68, through Shorter Exercise Cycles. In: Proceedings of 2013. V. Thurner, O. Radfelder, K. Vosseberg (Hrsg.): SEUH 2019 43