=Paper=
{{Paper
|id=Vol-2015/ABP2017_paper_01
|storemode=property
|title=AuDoscore: Automatic Grading of Java or Scala Homework
|pdfUrl=https://ceur-ws.org/Vol-2015/ABP2017_paper_01.pdf
|volume=Vol-2015
|authors=Norbert Oster,Marius Kamp,Michael Philippsen
|dblpUrl=https://dblp.org/rec/conf/abp/OsterKP17
}}
==AuDoscore: Automatic Grading of Java or Scala Homework==
                                3. Workshop „Automatische Bewertung von Programmieraufgaben“,
                                                                  (ABP 2017), Potsdam 2017 1
AuDoscore: Automatic Grading of Java or Scala Homework
Norbert Oster,1 Marius Kamp,1 and Michael Philippsen1
Abstract: Fully automated test-based grading is crucial to cope with large numbers of student
homework code. AuDoscore extends JUnit and keeps the task of creating exercises and corresponding
grading tests simple. Students have a set of public smoke tests available. Grading also uses additional
secret tests that check the submission more intensely. AuDoscore ensures that submissions cannot
call libraries if the lecturer explicitly forbids them. Grading is not susceptible to the problem of
consecutive faults by partially replacing student code with cleanroom code provided by a lecturer.
AuDoscore can be run as a stand-alone application or integrated into our Exercise Submission Tool
(EST). This paper briefly describes how both tools interact, depicts AuDoscore from the point of view
of the lecturer, and describes some key technical aspects of its implementation.
Keywords: grading, student code submission, Java, Scala
1    Introduction
Up to 750 freshmen from about 23 degree programs take our annual course on “Algorithms
and Data Structures” (AuD2 for short) that also teaches them to program in Java. The course
has a workload of 300 h (10 ECTS), 25% of which are devoted to writing Java programs of
different sizes as graded weekly homework. In some semesters, the students submit up to
17.000 Java source code files. Up to 390 other students submit more than 600 Scala files as
homework for our annual course on “Parallel and Functional Programming”, PFP.2 As this
is clearly too much code to be manually checked and graded, we apply an entirely automatic
grading process for both Java and Scala. Our lightweight grading system AuDoscore works
both stand-alone and as part of our “Exercise Submission Tool” (EST), our electronic
student and exercise management platform.
Below we first describe how students use our system. Then we present the lecturer’s interface
and share some technical details about AuDoscore. Finally, we outline our experience and
discuss some related work.
2    How Students Experience AuDoscore
When submitting solutions for programming exercises, students do not directly interact
with AuDoscore. Instead, they upload their source code files using our web-based lecture
1 Friedrich-Alexander University Erlangen-Nürnberg (FAU), Programming Systems Group,
  Martensstr. 3, 91058 Erlangen, Germany, email: [norbert.oster|marius.kamp|michael.philippsen]@fau.de
2 https://www2.cs.fau.de/aud and https://www2.cs.fau.de/pfp
2 Norbert Oster, Marius Kamp, Michael Philippsen
management system called “Exercise Submission Tool” (EST), which then passes their files
to AuDoscore. AuDoscore is an extension of JUnit and basically uses two test sets: a public
and a secret test set. The public tests are provided as a smoke test to ensure that the student
solutions adhere to the expected interfaces. Both the public and the secret tests contribute to
the grading, but the secret tests usually check the submission in more detail.
In conjunction with EST, the AuDoscore process comprises several stages: Students first
submit their solutions (one or more Java or Scala source code files) via the web-based EST.
Immediately after having uploaded a new submission, EST shows that work is in progress
( ) and forwards the source files to AuDoscore, which then performs the following four
stages before providing the respective feedback to the student.
Stage 0: The source code is compiled together with additional interfaces or helper classes
provided to the students, but without any test cases yet. If this stage fails, running and
grading the submission is impossible. Therefore, the student is immediately shown a skull
( ) and warned that this submission is worth 0 points.
Stage 1 checks if the submission compiles against the public test set and does not use
forbidden API functionality (see Section 4 for details). If it fails at least one check, the
student sees an symbol, indicating that the submission is worth 0 points.
Stage 2 executes the public test set. An exclamation mark ( ) indicates that the submission
fails on at least one public test case. Otherwise a concludes the preliminary feedback.
Stage 3 is the real grading step that executes the submission with both the public and secret
test sets. The latter tests are run against the original student source code as well as code
modified with replacements in order to reduce subsequent faults (see Section 4 for details).
Students can re-submit modified and improved versions as often as they like before the end
of the submission deadline. A lecturer has to confirm the grading result before students can
see their test results and the graded points on their EST web-page.
The feedback in stages 0–2 and the option to re-submit mimics a continuous integration.
It provides (partial) feedback as early as possible and even before the deadline, enabling
the students to continuously revise and learn. However, as AuDoscore must run all the
above stages upon each individual submission, the system load on the computers running
AuDoscore is high. We therefore use several distributed instances of DOMjudge.3
3    Lecturer’s User Interface
From the point of view of a lecturer, AuDoscore is just a set of additional annotations on top of
the JUnit framework. In order to apply automatic grading on a practical exercise, the lecturer
has to write a cleanroom solution and two JUnit test classes. For activation (Listing 1),
3 https://www.domjudge.org/
                                              AuDoscore: Automatic Grading of Java or Scala Homework 3
both the public and the secret test classes must declare a @Rule that tells AuDoscore how
to collect the result of each individual test case (test method). The mandatory @ClassRule
creates a summary of the grading results and the feedback to the student.
                                     List. 1: Rules used to activate AuDoscore.
@Rule
p u b l i c f i n a l P o i n t s L o g g e r p l = new P o i n t s L o g g e r ( ) ;
@ClassRule
p u b l i c f i n a l s t a t i c P o i n t s S u m m a r y p s = new P o i n t s S u m m a r y ( ) ;
A secret test class is simply marked with a @SecretClass annotation. The public test class
requires an @Exercises annotation, which holds a nested array of @Ex annotations, i.e., a
tuple of an arbitrary but unique name and its grading points. Listing 2 shows an example.
                                 List. 2: Declaration of tasks and grading points.
@ E x e r c i s e s ( {@Ex( exID = " T1a " , p o i n t s = 5 ) , @Ex( exID = " T1b " , p o i n t s = 7 ) ,
                       @Ex( exID = " T2 " , p o i n t s = 1 1 ) } )
Each test case (test method) must be annotated with the JUnit annotation @Test and an
annotation @Points with bonus or malus values. To ensure that faults in student submissions
do not exceed given runtime limits (e.g., by means of infinite loops or recursion), all @Tests
must have a positive timeout (in milliseconds).
                                     List. 3: Declaration of grading test cases.
@Points ( exID = " T2 " , b o n u s = 8 ) @Test ( t i m e o u t = 5 0 0 )
public void t e s t 1 ( ) { . . . }
@Points ( exID = " T2 " , m a l u s = 6 ) @Test ( t i m e o u t = 1 0 0 0 )
public void t e s t 2 ( ) { . . . }
The bonus attribute in Listing 3 specifies that a student gains the given number of points if
the corresponding test succeeds. A failing bonus test does not yield any points. In contrast, a
successful test case annotated with a malus attribute has no impact on the final score whereas
a failing malus test reduces the number of points by the specified amount. Thus, malus
tests allow a lecturer to cancel the effects of a passed bonus test. We outline the relevance
of malus tests in Section 5 where we describe how we use such tests to detect possible
cheating.
To facilitate the addition or removal of test cases, the sum of the bonus values does not have
to match the maximal number of points specified in the @Ex annotation. Instead, AuDoscore
computes a relative score for a task τ as follows:
                           Í passed           Í f ailed
                                    |bonus| − T Cτ |malus| ª
                                                                              
                            T Cτ
         gradeτ = max 0,                                 ® · |E xτ .points| 
                      ©                                                       
                                     Í
                                      T Cτ |bonus|                            
                      «                                   ¬                    (0.5)
4 Norbert Oster, Marius Kamp, Michael Philippsen
Example: Let us assume we have five test cases for τ = "T2": four with a bonus = 4, 8,
10, 12, resp., and one with a malus = 6. As declared in Listing 2, T2 is worth a total of
11 points.jIf a submission
                                     k first (b = 4) and the last (m = 6) test cases, then it
                            failson the
                    (8+10+12)−(6)
is graded max 0, 4+8+10+12 · 11             = 7.5 points. If a lecturer adds another test case
                                                     (0.5)
without changing the total number of points, (s)he does not have to adjust the existing tests
because the bonus and malus values are normalized wrt. the total number of points.
In order to learn and practice basic algorithms and data structures, students are typically
requested to implement for instance sorting procedures or fundamental data structures
like linked lists or heaps. As the Java API comes with several pre-defined classes and
methods that would undermine the intention of the homework, lecturers can limit the parts
of the Java/Scala API that students may use. To do so, lecturers use the @Forbidden and
@NotForbidden annotations. Both accept an array of strings, each one denoting a (possibly
wild-carded part of the) fully qualified name of packages, classes, methods, or fields. The
example in Listing 4 is taken from a homework in which students had to implement a hash
table. The annotations disallow the use of any class from java.util and restrict the usable
methods from java.lang.Math to round and pow only. If a student uses at least one of the
forbidden API features, AuDoscore issues a notification in Stage 1 (see Section 2).
                            List. 4: Restriction of available Java API functionality.
@Forbidden ( { " j a v a . u t i l " , " j a v a . l a n g . Math " } )
@NotForbidden ( { " j a v a . l a n g . Math . r o u n d " , " j a v a . l a n g . Math . pow " } )
Good object-oriented programmers break down the functionality of their applications into
several softly coupled but highly cohesive methods. In order to train this skill, lecturers can
provide interfaces of the classes to be implemented, declaring and describing the different
methods and their cooperation. Consider an exercise requiring a QuickSort implementation
with the methods sort, partition, choosePivot, and swap. If a student implemented a faulty
swap and calls this method from a correct partition, individual grading test cases for both
swap and partition fail due to the common cause. In order to cope with those cascading
consecutive failures without punishing students twice, AuDoscore provides the @Replace
annotation as shown in Listing 5 for the QuickSort example.
                         List. 5: Replacement of student code with cleanroom code.
@Replace ( { " Q u i c k S o r t . swap " , " Q u i c k S o r t . c h o o s e P i v o t " } )
Test cases annotated this way are first executed with the original code as submitted by
the student. Then, these tests are run again, but beforehand, the methods declared in the
@Replacement are exchanged with the cleanroom solution implemented by the lecturer (see
Section 4 for technical details). The final grading is computed as the maximal points of
both test runs—this way, incompatible implementations of the cleanroom codes never lower
the results that the unmodified submission achieves. In the example above, a test case for
partition annotated with a @Replace as shown in Listing 5 no longer fails due to a faulty
                                     AuDoscore: Automatic Grading of Java or Scala Homework 5
implementation of swap. In contrast, a faulty partition method still causes a failure of this
test case.
4    Technical Design of AuDoscore
This section explains some technical details of the implementation available on GitHub.4 One
of the core components of our grading system is the set of annotations sketched in Section 3.
Since those annotations are merely meant to provide information at runtime, the main
functionality is implemented in the PointsLogger resp. PointsSummary classes. PointsLogger
subclasses the JUnit API class TestWatcher and is used to intercept the test set execution.
This way, JUnit notifies AuDoscore about skipping resp. starting and completing individual
test cases—including the verdict of each test run. For each test execution, PointsLogger
extracts the task name and test case weight from the @Points annotation and stores them
together with the verdict and execution time in a ReportEntry. PointsSummary extends the
JUnit class ExternalResource and gives AuDoscore control over the test execution. Upon
startup, this class installs a custom SecurityManager that prevents students from interfering
with AuDoscore and from injecting unwanted code that, e.g., opens arbitrary files, executes
processes, or stops the VM. PointsSummary also replaces System.out and System.err in order
to prevent flooding the grading output and to exclude the runtime of student debugging
output from the runtime measurements. Additionally, it validates the annotations of the
public and secret test classes and methods (e.g., to enforce timeouts and @Points for all
test cases), collects the information from the @Exercises annotation and prepares for report
acquisition. After having executed all tests, JUnit notifies PointsSummary to then generate
reports for both the student and the lecturer.
The technically most sophisticated component of AuDoscore is the @Replacement support.
The class ReplaceMixer exchanges individual methods in the student submission with their
cleanroom counterparts. To achieve this goal, we use the compiler functionality built into
the JDK to manipulate the abstract syntax tree (AST) of the student source code. Thus,
the @Replacement mechanism is the only component that needs modification when porting
AuDoscore to another JVM language.
We steer the features @Forbidden, @NotForbidden, and @Replace from outside Java using
simple Linux shell scripting. This script runs the stages described in Section 2 and checks
the restrictions of @Forbidden and @NotForbidden by searching the output of the javap
disassembler for matches of the forbidden prefixes. This way, lecturers may also forbid
individual bytecode instructions. In Stage 3, the script runs all test sets (public and secret)
with and without all required @Replacements and collects the results. A similar shell script
applies AuDoscore in a local stand-alone environment without EST integration, e.g., during
the development of the cleanroom and test sets or when lecturers resp. teaching assistants
need to inspect the results of individual student code in more depth.
4 https://github.com/FAU-Inf2/AuDoscore
6 Norbert Oster, Marius Kamp, Michael Philippsen
Limitations. AuDoscore still has some limitations that we plan to tackle in the future.
First, it considers only test runs for grading and does not yet take into account coding
style or other non-functional aspects of the submitted code. Second, AuDoscore only
works for programs written in Java or Scala, although it may be easily extended to any
JVM-based language. Third, AuDoscore lacks specialized features to test parallel code.
Fourth, AuDoscore provides no means to integrate external libraries. Fifth, AuDoscore
cannot be configured to search for @Forbidden code only in parts of the submissions (e.g.,
individual methods or specific code lines).
5         Experience
The statistics in Table 1 summarize the productive AuDoscore use for the AuD and PFP
courses from winter term 2014/15 to 2016/17.
                            Table 1: Statistics on 5 semesters of AuDoscore use.
           Winter/Summer Term           WT2014/15     ST2015     WT2015/16     ST2016    WT2016/17
           # registered students              668         198          716         172         733
           # programming tasks                 37          30           38          32          34
           # Java submissions               10478        1263        12526         856       11189
    AuD
           # submitted Java files           12685        1647        17274        1167       13846
           # LOC (total)                  1230394       94069      1528133       80859     1215662
           # public/secret test cases     160/562     147/737      336/810     143/635     224/804
           # post-deadline revisions           26          24           24          27          21
           # manually changed grades          102           9          723           2          13
           # registered students                –           –           24         390          51
           # programming tasks                  –           –           12           7           7
           # Scala submissions                  –           –            1         699          21
    PFP
           # submitted Scala files              –           –            1         699          21
           # LOC (total)                        –           –           14       18144         487
           # public/secret test cases           –           –        24/21       15/54       15/54
           # post-deadline revisions            –           –            0           6           0
           # manually changed grades            –           –            0          76           0
AuDoscore worked reliably and used the given number of test cases to grade the submitted
homework. Only in rare cases we had to revise some secret tests after the deadline, as the
row named “post-deadline revisions” shows. Due to the large diversity of submissions, we
sometimes did not anticipate a certain way of misbehavior. We also had mistakes where the
homework compiled with the public test but not with the secret test. The latter typically
happened when the public test did not exercise the interface of the student submission to
the same extent as the secret test. Since the preliminary feedback shown to the student
ends with Stage 2, mistakes of this kind are immediately shown as an internal error to
the lecturer only. This way, lecturers can revise the test cases early and even during the
submission period. Table 1 also shows that in general the students considered the grading to
                                  AuDoscore: Automatic Grading of Java or Scala Homework 7
be fair. Only rarely they could convince a teaching assistant or lecturer to manually change
grades. Other manual interventions were necessary to cope with an AuDoscore limitation,
e.g., when something was @Forbidden in a certain segment of the students’ code only.
Advice on Test Case Construction. When students are asked to explicitly write a recursive
solution, the annotations of Section 3 provide no straightforward way to test whether they
really did so. Hence, task specifications should require that students call a specific method
in the base cases of their recursion. AuDoscore can then inspect the stack trace during test
execution to grade recursive programs. Since this feature was missing, we had to manually
check and down-grade an unusual number of AuD submissions in WT2015/16.
Mimicking continuous integration also opens up the door for a vector of attacks. Instead of
implementing the expected algorithm, we have seen submissions that just contain a cascade
of if-cases, each of which returning exactly the expected value visibly declared in the public
test. To detect such submissions, the lecturer can use anti-cheat tests. For example, such
tests may call the method to be graded with many different arguments. From the diversity
of the results one can judge whether the implementation is (almost) univariate and thus
most probably cheating. These anti-cheat tests may be implemented as secret malus tests to
(partially) cancel the points granted by the public tests.
6   Related Work
There already exist several systems for automatically grading programming assignments.
According to recent literature reviews [CAR13, Ih10], many of them serve a special purpose
(e.g., are developed solely for a specific lecture) whereas AuDoscore and only some other
tools are geared towards widespread adoption.
The methods for automatically grading programming assignments differ regarding how
students use them. Systems like Marmoset [Sp06] require that students install custom tools
on their workstations to submit solutions. Others (e.g., PABS [If15]) work in concert with a
version control system. In contrast, the system by Amelung et al. [AFR08] or AuDoscore
are purely web-based and thus allow students to work with any environment.
Like AuDoscore, many systems rely on testing to evaluate submissions [KJH16]. Some tools
(e.g., ProgTest [dSMB11] and Web-CAT [EP08]) even use test cases written by students
during the grading process. AuDoscore currently only supports test sets written by the
lecturer. AutoLEP [Wa11], among others, combines testing with static analysis to determine
a grade. We plan to investigate how an external static analysis tool can easily be integrated
into AuDoscore using a lightweight set of Java annotations.
Beside testing, JACK [GSB08] uses graph transformation rules on a graph generated from
the submitted source code. This allows a lecturer to specify checks that a valid solution
8 Norbert Oster, Marius Kamp, Michael Philippsen
must pass. This approach is more powerful than the annotations provided by AuDoscore.
However, we argue that annotations are easier to write than graph transformation rules.
Graja [Ga15] extends JUnit by providing helper classes for common grading tasks. It also
uses annotations to improve the feedback shown to the students. In contrast, AuDoscore
is designed to smoothly integrate into usual JUnit testing by introducing lightweight
annotations that configure the grading system.
The recent trend to teach mobile application development in introductory courses has led
to an emergence of tools like RoboLIFT [AE12] or the system by Heimann et al. [He15]
for the automated assessment of Android applications. Furthermore, there are specialized
solutions for assessing concurrent exercises [OB07]. Some systems, like GATE [MS13],
support randomized exercises. Praktomat [KSZ02] also supports multiple variants of an
exercise and additionally allows mutual feedback among students. Currently, AuDoscore
lacks these advanced features.
7   Conclusion
This paper presents AuDoscore, a system for the fully automated test-based grading of
student homework code submissions. We described the feedback that AuDoscore presents
to students and how a lecturer can quickly enhance JUnit tests to use them for automatic
assessment. This notably reduces the time required to grade the submissions. AuDoscore
also provides features like forbidden API calls and code replacements to avoid the problem
of consecutive faults. Our experience shows that AuDoscore scales well even for courses
taken by 750 students. However, our approach seems to allure students to try to outsmart
the system. A lecturer has to take this into account when designing tests.
Acknowledgments
We thank all contributors to AuDoscore, in particular Tobias Werth.
References
[AE12]    Allevato, Anthony; Edwards, Stephen H.: RoboLIFT: Engaging CS2 Students with
          Testable, Automatically Evaluated Android Applications. In: SIGCSE’12: Technical
          Symp. Computer Science Education. Raleigh, NC, pp. 547–552, February–March 2012.
[AFR08]   Amelung, Mario; Forbrig, Peter; Rösner, Dietmar F.: Towards Generic and Flexible Web
          Services for E-Assessment. In: ITiCSE’08: Conf. Innovation and Technology in Computer
          Science Education. Madrid, Spain, June–July 2008.
[CAR13]   Caiza, Julio C.; Álamo Ramiro, José M. del: Programming Assignments Automatic
          Grading: Review of Tools and Implementations. In: INTED’13: Intl. Technology,
          Education and Development Conf. Valencia, Spain, pp. 5691–5700, March 2013.
                                   AuDoscore: Automatic Grading of Java or Scala Homework 9
[dSMB11] de Souza, Draylson M.; Maldonado, José C.; Barbosa, Ellen F.: ProgTest: An Environment
         for the Submission and Evaluation of Programming Assignments based on Testing
         Activities. In: CSEE&T’11: Conf. Softw. Eng. Education and Training. Waikiki,
         Honolulu, HI, pp. 1–10, May 2011.
[EP08]     Edwards, Stephen H.; Pérez-Quiñones, Manuel A.: Web-CAT: Automatically Grading
           Programming Assignments. In: ITiCSE’08: Conf. Innovation and Technology in Computer
           Science Education. Madrid, Spain, p. 328, June–July 2008.
[Ga15]     Garmann, Robert: E-Assessment mit Graja – ein Vergleich zu Anforderungen an Soft-
           waretestwerkzeuge. In: ABP’15: Automatische Bewertung von Programmieraufgaben.
           Wolfenbüttel, Germany, November 2015.
[GSB08]    Goedicke, Michael; Striewe, Michael; Balz, Moritz: , Computer Aided Assessment and
           Programming Exercises with JACK. ICB-Research Report No 28, 2008.
[He15]     Heimann, Mathis; Fries, Patrick; Herres, Britta; Oechsle, Rainer; Schmal, Christian:
           Automatische Bewertung von Android-Apps. In: ABP’15: Automatische Bewertung von
           Programmieraufgaben. Wolfenbüttel, Germany, November 2015.
[If15]     Iffländer, Lukas; Dallmann, Alexander; Beck, Philip-Daniel; Ifland, Marianus: PABS - a
           Programming Assignment Feedback System. In: ABP’15: Automatische Bewertung von
           Programmieraufgaben. Wolfenbüttel, Germany, November 2015.
[Ih10]     Ihantola, Petri; Ahoniemi, Tuukka; Karavirta, Ville; Seppälä, Otto: Review of Recent
           Systems for Automatic Assessment of Programming Assignments. In: Koli Calling’10:
           Koli Calling Intl. Conf. Computing Education Research. Koli, Finland, pp. 86–93, October
           2010.
[KJH16]    Keuning, Hieke; Jeuring, Johan; Heeren, Bastiaan: Towards a Systematic Review of
           Automated Feedback Generation for Programming Exercises. In: ITiCSE’16: Conf.
           Innovation and Technology in Computer Science Education. Arequipa, Peru, pp. 41–46,
           July 2016.
[KSZ02]    Krinke, Jens; Störzer, Maximilian; Zeller, Andreas: Web-basierte Programmierpraktika
           mit Praktomat. Softwaretechnik-Trends, 22(3), 2002.
[MS13]     Müller, Oliver; Strickroth, Sven: GATE - Ein System zur Verbesserung der Programmier-
           ausbildung und zur Unterstützung von Tutoren. In: ABP’13: Automatische Bewertung
           von Programmieraufgaben. Hannover, Germany, October 2013.
[OB07]     Oechsle, Rainer; Barzen, Kay: Checking Automatically the Output of Concurrent Threads.
           In: ITiCSE’07: Conf. Innovation and Technology in Computer Science Education. Dundee,
           Scotland, pp. 43–47, June 2007.
[Sp06]     Spacco, Jaime; Hovemeyer, David; Pugh, William; Emad, Fawzi; Hollingsworth, Jef-
           frey K.; Padua-Perez, Nelson: Experiences with Marmoset: Designing and Using an
           Advanced Submission and Testing System for Programming Courses. In: ITiCSE’06:
           Conf. Innovation and Technology in Computer Science Education. Bologna, Italy, pp.
           13–17, June 2006.
[Wa11]     Wang, Tiantian; Su, Xiaohong; Ma, Peijun; Wang, Yuying; Wang, Kuanquan: Ability-
           training-oriented automated assessment in introductory programming course. Computers
           & Education, 56(1):220–226, 2011.