=Paper=
{{Paper
|id=Vol-2066/seels2018paper02
|storemode=property
|title=Modularity for Automated Assessment: A Design-Space Exploration
|pdfUrl=https://ceur-ws.org/Vol-2066/seels2018paper02.pdf
|volume=Vol-2066
|authors=Steffen Zschaler,Sam White,Kyle Hodgetts,Martin Chapman
|dblpUrl=https://dblp.org/rec/conf/se/ZschalerWHC18
}}
==Modularity for Automated Assessment: A Design-Space Exploration==
Modularity for Automated Assessment: A Design-Space Exploration Steffen Zschaler Sam White Kyle Hodgetts Martin Chapman Department of Informatics Department of Informatics Department of Informatics Department of Informatics King’s College London King’s College London King’s College London King’s College London The Strand The Strand The Strand The Strand London London London London UK UK UK UK szschaler@acm.org Abstract—As student numbers continue to increase, automated attacks by student code under assessment and to be scalable assessment is an inevitable element of programming education in order to manage the highly bursty workload of large in university contexts. Modularity is a key factor in ensuring classes of students working against coursework submission these systems are flexible, robust, secure, scalable, extensible, and maintainable. Yet, modularity has not been explicitly and deadlines. Errors are inevitable in any software development, systematically discussed in this field. In this paper, we first so we want to be able to partially update an existing, running present an overview of the modularity design space for automated automated grading system, as fixes for problems identified assessment systems and a discussion of existing systems and their become available. Finally, academics want to be able to place in this space. This is followed by a brief overview of our continuously develop our grading systems as our research into novel N EXUS platform, which uses fine-grained modularisation of graders implemented through a micro-service architecture. better methods for assessment and feedback evolves. Modularity is key to satisfying all of these requirements. I. I NTRODUCTION By breaking grading systems into suitable components and Automated grading has been of interest to computer-science ensuring they can be exchanged and recombined flexibly educators for a long time. ASSYST [13] was probably the and robustly, we establish the foundations for extensibility, first systems that tried to provide automation support for the maintainability and reliability. By ensuring independent exe- task of assessing student submissions to programming assign- cution of components, we achieve scalability through appro- ments. At the time, automation was focused on giving a first priate replication and load-balancing techniques. Modularity assessment of a piece of software, with responsibility for the also provides opportunities for encapsulation and sandboxing, actual grading and provision of feedback still remaining firmly which can help ensure secure execution of student code. with the human teaching staff. As student numbers increase, Some existing systems already reap some of the benefits we see a stronger move towards fully automated grading of modularity. However, to the best of our knowledge, no and feedback. Such a system has many benefits, including a systematic mapping of the design space for modularity in reduction in time spent on marking by teaching staff, freeing automated grading systems and the benefits and drawbacks them up for more productive work, and an opportunity for of different spots in this space exists. students to receive more feedback on incremental development In this paper, we provide a first such analysis. We begin by stages of their software. exploring two dimensions of modularity: what to modularise As the functionality and workload of automated grading and how to modularise. These dimensions span a design space, systems increases, it has become evident that we need to and we identify different positions in this space taken by dif- consider sound software-engineering principles in the develop- ferent existing systems. Our preference is for a fine-granular, ment of these systems. We have requirements on the reliability, micro-service–based modularisation and we will briefly sketch security, extensibility, scalability, and maintainability of these the architecture of our N EXUS system, implementing such an grading systems, which cannot easily be satisfied with a simple architecture. set of shell scripts to be invoked by a teacher in response to a set of student submissions. Students are asking for web-based systems well integrated with their preferred tool infrastructure II. M ODULARITY IN AUTOMATED A SSESSMENT S YSTEMS and providing high-quality, near-time feedback. Teachers are interested in using automated grading systems for a range We discuss two dimensions of the design space for modular of modules teaching programming in different languages, automated-assessment systems: teaching other aspects of computer science, or even beyond 1) What to modularise. Which different concerns in a grading computer science. System operators want automated grading system should be separated out into different modules? At systems to be robust against (intentional or unintentional) what granularity? Software Engineering für E-Learning-Systeme @ SE18, Ulm, Germany 57 2) How to modularise. What are the technology choices perform some style checks or static analyses, run a number available for implementing components and re-composing of unit tests, and provide a combined grade and accumu- them into a working automated-assessment system? lated feedback to the student. The automated-assessment For each dimension, we discuss a range of choices taken systems listed in the previous paragraph are examples of by different existing systems1 as well as the benefits and coarse-grained grader modularisation. drawbacks of these choices. 2) Fine-grained grader modularisation considers graders building blocks for constructing a marking scheme for an A. What to modularise assignment. This implies graders can be more flexibly re- In [20], Sclater and Howie describe requirements for the used across assignments and assignments can choose to use “ultimate” automated assessment system (AAS). In a similar only the graders required. Fine-grained grader modularisa- vein, in Fig. 1 we sketch the logical architecture of the tion has, for example, been applied in ASB [9], JACK [23], “maximal” AAS. Figure 1 is meant to highlight the units of and CourseMarker/Ceilidh [10]. functionality required for any AAS, but should not be read as Clearly, fine-grained grader modularisation has many ben- a description of how to bundle these functionalities into actual efits. In particular, some graders are easily reusable across components. Any AAS would need to allow assignments to assignments and courses (e.g., a peer feedback grader as be created and viewed and submissions to be sent to the described for Praktomat [24]), and with fine-grained modu- system (possibly through a number of different submission larisation, these can be easily reused for different courses. pathways). Submissions received need to be graded by grader Similarly, fine-grained grader modularisation makes it easy services; multiple grading steps will likely be involved and to select exactly the set of graders that should be applied require some form of scheduling and weighting of the resulting for a particular assignment and to weight the grades provided marks according to a mark scheme. Marks and feedback will specifically for the expected level of teaching. For example, need to be presented to students and academics in meaningful for beginning programmers, we might put more weight on ways. To execute grading, a number of execution services compilation, while for more advanced programmers this would are needed, such as sandboxing of execution, monitoring, become merely a check at the start of the assessment, with or possibly load balancing across grader servers. System substantial weight placed on the assessment of functionality. wide, common services are required to provide solutions for At the same time, there are drawbacks to fine-grained modu- submission storage, auditing of the grading process, plagiarism larisation. First, graders must be developed independently and detection, or setting of unique assignments etc. Often, AAS are cannot easily make assumptions about other graders. This may used in the context of existing virtual-learning environments mean some duplication of effort (e.g., for compilation) and (VLE) and some functionality needs to be provided to connect requires explicit support for workflow management to ensure VLE and AAS. redundant feedback is not given to students (e.g., when com- In the following, we discuss different options for “pack- pilation fails, attempting to run unit tests would only confuse aging” these logical units of functionality into actual system students with redundant feedback). With coarse-grained grader components. Not all AAS provide all functionalities—for modularisation, these interdependencies can be easily hard- example, many AAS do not provide explicit support for VLE coded. With fine-grained modularisation, responsibility rests integration. with the assignment author. An important question is who Any automated assessment system will need to provide “owns” which part of the data: most current techniques seem support for defining and managing assignments, enable stu- to focus on giving ownership to the central system, so that dents to make submissions (possibly through a number of graders must fit all their configuration data into a centrally different submission pathways), execute grading tools on the pre-defined schema (e.g., using ProForma [22]). ASB [9] submissions, and present grades and feedback to students. In supports hierarchical configuration, where configuration at the most grading systems—including ASSYST [13], Graja [8], course level is automatically reused for all assignments etc. PABS [12], PASS [5], ASAP [6], DUESIE [11], BOSS [14], Configurations still need to correspond to a centralised data or Marmoset [21]—these two concerns are separated into at schema. Typically (e.g., as described for JACK [23]) this least two components: one component provides functionality centralisation means that the data is also managed by the for teachers to define assignments and for students to upload central “assignment manager” so that graders need to request submissions, while a second component performs the actual it every time they grade a submission. As we will describe in grading. Sect. III, we prefer a modularised configuration, where each In modularising graders, we differentiate two levels of grader “owns” its own configuration data (both the schema granularity: and the actual data) as this decouples different parts of the 1) Coarse-grained grader modularisation packages a com- system better and requires less network traffic when marking plete grading pipeline (often for one course) into one submissions. module. For example, a grader might compile Java code, Other concerns of automated assessment have also been 1 We tried to include as many existing systems as possible, but do not claim considered for modularisation. However, more detailed analy- complete coverage of the literature. sis of these modularisation choices is clearly still needed: Software Engineering für E-Learning-Systeme @ SE18, Ulm, Germany 58 Assignment Manager Grader Engine (sandboxing, monitoring, Execution Services Assignment Creation Feedback Presentation Scheduling load balancing, …) (mark schemes) VLE Integration Submission Management Assignment View ZIP Webform Git Graders Graders Email IDE … Common Services (Auditing, File Storage/Transfer, Authentication/Authorisation, Unique Assignments, Plagiarism Detection…) Fig. 1. Logical architecture of the “maximal” automated-assessment system • Some grading systems have focused on modularising their the use of state-of-the-art virtualisation, load-balancing, and front-end services. For example, BOSS [14] distinguishes swarm-management techniques. Containerisation also helps student and teacher servers to support the different roles with security as containers can be confined in their use of engaging with the system (in principle, this could be ex- resources. This has been initially explored, for example, for tended to a larger subset of the ideal set of stakehold- Praktomat [4]. ers [20]). Similarly, FW4EX [19] defines separate servers Distribution and using web-based APIs brings its own for creating/editing assignments, accessing assignments, and potential security challenges: if these API endpoints can be uploading submissions, so that these can be resourced in accessed by students, extra measures are needed to prevent accordance with their different performance requirements. students from submitting fake marks for their own assignments FW4EX also has some modularisation of different submis- or from manipulating grader configurations. Service-based sion pathways through the introduction of IDE plugins for distribution potentially also creates concerns about trust be- use by students. tween services, especially where they are managed by different • Some works have focused on modularising general services. organisations. For example, Grappa [7] provides generic middleware for With fine-grained grader modularisation, an interesting connecting graders to a standard VLE. PABS [12] uses a question is how to specify mark schemes (i.e., which graders to standard SVN server as the shared file-storage system for use and how to weight the marks provided). For coarse-grained submissions and assignment data (although, surprisingly, grader modularisation, the mark scheme can be hard-coded both are kept in the same repository). ASB [9] explicitly into the system or possibly configured by parametrisation of modularises different execution environments. A number of the fixed steps [8]. For fine-grained grader modularisation, dif- systems provide plagiarism detection services as a separate ferent systems provide different answers. Some systems [10] module. have used Java code to implement mark schemes. This max- imises flexibility, but also requires more attention to low-level B. How to modularise detail from the assignment developer. Other systems [22], [9], The decision of what to modularise affects primarily the [19] allow configuration through (XML) files at the level of maintainability, extensibility, and flexibility of the system. In assignments. This is less flexible, as essentially the assignment order to modularise for scalability, robustness, and security, we developer can only choose from a range of pre-defined con- need to consider how to implement the modular architecture. figuration options, but provides a more standardised interface There is a clear trend from monolithic systems with and higher level of abstraction. some internal modularisation (typically using object-oriented principles—for example, PASS [5]) to more distributed and III. N EXUS : A M ICRO -S ERVICE A PPROACH TO loosely coupled multi-process systems (e.g., JACK [23] or F INE -G RANULAR M ODULARITY [17]). For the latter systems, different communication proto- At King’s we are developing an automated assessment plat- cols have been experimented with: [17] uses internal email form called N EXUS. This platform was designed specifically communication, ASB [9] implements a dedicated event-bus, to be flexible and extensible, including potentially to non- JACK [23] uses a bespoke protocol with graders pulling programming modules. To support extensibility, we aimed new submissions from the assignment manager. Recently— to maximise modularity: N EXUS uses fine-granular modu- for example in eduComponents [1]—service-based architec- larisation of graders and modularises a number of common tures are being discussed. Especially when building on stan- services: in particular, all submissions are stored in GitHub dardised protocols (such as REST-ish web APIs, container- Enterprise (decoupling graders from submission pathways and isation through Docker or similar, . . . ), these substantially providing access to a student’s submission history) and we simplify the scaling of automated assessment systems through provide generic support for the generation of unique assign- Software Engineering für E-Learning-Systeme @ SE18, Ulm, Germany 59 TABLE I for repeat processing.2 This increases overall reliability and C URRENTLY AVAILABLE GRADERS resilience. Grader Purpose We provide explicit support to express grader dependencies javac Check compilation and code style of Java code and use simple distributed dataflow controls [2]: graders can jUnit Run unit tests against Java code specify the type of files they produce and require. Given io-test Run input-output tests against Java code dyn-trace Capture Java execution traces & compare against such specifications, N EXUS will automatically determine a model solution maximally concurrent grader execution. When invoking a matlab Grade MatLab-based mathematics submissions grader, N EXUS provides information about which kinds of python Unit test python-based submissions peer-review Enable student peer review of submissions files should be sent to other graders. Graders then send on manual Support manual grading of assignments any files produced as indicated. Subsequent graders can use these additional files without having to recreate them. Modularisation and the use of web interfaces creates a new attack surface. Students could, in principle, attempt to spoof ments. Flexibility and security is increased by decoupling all marks by sending requests directly to N EXUS’ IMark inter- components into their own micro-service [15]; with services face. Similarly, they could attempt to modify the configuration interacting through REST-ish web-API endpoints. Figure 2 of graders by directly accessing their IConfig interfaces. To gives an overview of the architecture. Each service (with the prevent such attacks, N EXUS uses randomly generated tokens exception of GitHub Enterprise, which is managed separately which must be passed along with any HTTP/HTTPS requests. within King’s) is maintained and managed by us in a central These allow the receiver to check that the invocation did monorepo [16] and is deployed in its own Docker containers. indeed originate from the claimed source. Graders provide As a result, we can easily scale the system by deploying feedback as HTML code, which is directly embedded into redundant instances of overloaded services and using Docker the N EXUS feedback page. If graders were allowed to be Swarm to ensure adequate load balancing. Using a monorepo arbitrary web-services, this could easily be a security risk means development can proceed as if the system was a tightly through the potential for XHR attacks. However, in our micro- coupled monolith, ensuring continuous compatibility between service architecture, all services are directly under our control all micro-services in the overall architecture. so that we can trust their implementations. All parts of the system are loosely coupled. For exam- IV. C ONCLUSIONS ple, the available graders are configured by providing the Modularity concerns are important for developing robust, URLs of their respective API endpoints for configuration and scalable, flexible, and extensible automated assessment sys- submission-marking. This makes it easy to add in additional tems. Yet, modularity has not been systematically discussed graders even at runtime, by adding their API information in this field to date. We have presented an exploration of through the web-based administration interface. We use a the modularity design space for automated assessment, in- fine-granular modularisation of graders (see Table I), includ- cluding how different existing systems are positioned in this ing some graders such as peer-review or manual that space and some of the benefits and drawbacks of different can be easily reused across different modules. Graders own choices. Additionally, we have briefly presented our novel their configuration data, allowing for arbitrarily complex data N EXUS platform taking a previously unoccupied position in schemata. For example, the peer-feedback grader allows this space by providing fine-granular grader modularisation the configuration of a web-form for students to fill in when realised through a micro-service architecture. We believe that providing a review. this architecture provides substantial benefits to the robustness Because graders are independent micro-services, faults are and flexible extensibility of our platform as well as to its easily contained in the problematic grader. A fault in a grader scalability and reliability. In the future, we plan to extend may mean students having to wait longer to receive a grade or usage of our N EXUS platform, including to modules outside of feedback, but will not cause issues for the remaining graders computer science and, in particular, will focus on improving or assignments. This also minimises the opportunities for the feedback provided by individual graders. rogue student code to attack the grading infrastructure. At R EFERENCES the same time, it also simplifies incremental improvement [1] M. Amelung, P. Forbrig, and D. Rösner, “Towards generic and of graders. Student submissions can sometimes be highly flexible web services for e-assessment,” in Proc. 13th Annual creative and difficult to predict, occasionally causing graders Conf. Innovation and Technology in Computer Science Education to fail processing a submission. Because all submissions are (ITiCSE’08). ACM, 2008, pp. 219–224. [Online]. Available: http: //doi.acm.org/10.1145/1384271.1384330 managed centrally and the actual submission files are kept [2] W. Binder, I. Constantinescu, and B. Faltings, “Decentralized orches- in GHE, a faulty grader can easily be repaired and a new tration of composite web services,” in IEEE Int’l Conf Web Services version spun up (reusing configuration data from the service’s (ICWS’06), 2006, pp. 869–876. [3] O. J. Bott, P. Fricke, U. Priss, and M. Striewe, Eds., ser. Digitale Medien database). N EXUS provides the ability to request a regrading in der Hochschullehre. Waxmann Verlag GmbH, 2017, vol. 6. of a particular submission or a set of submissions, whereupon these submissions are simply sent to the relevant graders again 2 N EXUS maintains an audit log tracking any such requests Software Engineering für E-Learning-Systeme @ SE18, Ulm, Germany 60 ITeacher IMark IFeedback IConfig GradingService Grading Service Core Management Grading Service Component IGrade <> IStudent IGenerate GitHub API / Git GitHub Enterprise Unique Assignments OAuth IConfig Fig. 2. Architecture of Nexus. Each box is a separate micro-service, all interfaces are REST-ish web-API endpoints. [4] J. Breitner, M. Hecker, and G. Snelting, “Der grader praktomat,” in [13] D. Jackson and M. Usher, “Grading student programs using Automatisierte Bewertung in der Programmierausbildung, ser. Digitale ASSYST,” in Proc 28th Technical Symposium on Computer Science Medien in der Hochschullehre, O. J. Bott, P. Fricke, U. Priss, and Education. ACM, 1997, pp. 335–339. [Online]. Available: http: M. Striewe, Eds. Waxmann Verlag GmbH, 2017, vol. 6. //doi.acm.org/10.1145/268084.268210 [5] M. Choy et al., “Design and implementation of an automated system [14] M. Joy, N. Griffiths, and R. Boyatt, “The Boss online submission and for assessment of computer programming assignments,” in Proc. 6th assessment system,” J. Educ. Resour. Comput., vol. 5, no. 3, Sep. 2005. Int’l Conf. Advances in Web Based Learning (ICWL’07, H. Leung [Online]. Available: http://doi.acm.org/10.1145/1163405.1163407 et al., Eds. Springer, 2008, pp. 584–596. [Online]. Available: [15] S. Newman, Building Microservices: Designing Fine-Grained Systems. https://doi.org/10.1007/978-3-540-78139-4 51 O’Reilly, 2015. [6] C. Douce et al., “A technical perspective on ASAP – automated [16] M. Oberlehner, “Monorepos in the wild,” 2017, last accessed 30 system for assessment of programming,” in Proc. 9th Computer-Assisted January, 2018. [Online]. Available: https://medium.com/@maoberlehner/ Assessment (CAA) Conference, 2005. monorepos-in-the-wild-33c6eb246cb9 [7] P. Fricke et al., “Grading mit Grappa – Ein Werkstattbericht,” in Proc. [17] A. Pardo, “A multi-agent platform for automatic assignment 2nd Workshop “Automatische Bewertung von Programmieraufgaben” management,” in Proc. 7th Annual Conf. Innovation and Technology (ABP’15), ser. CEUR-WS, U. Priss and M. Striewe, Eds., vol. 1496, in Computer Science Education (ITiCSE’02). ACM, 2002, pp. 60–64. 2015, pp. 9–1–9–8. [Online]. Available: http://ceur-ws.org/Vol-1496/ [Online]. Available: http://doi.acm.org/10.1145/544414.544434 paper9.pdf [18] U. Priss and M. Striewe, Eds., Proc. 2nd Workshop “Automatische [8] R. Garmann, “Graja – Autobewerter für Java-Programme,” Fakultät IV Bewertung von Programmieraufgaben” (ABP’15), ser. CEUR-WS, vol. – Wirtschaft und Informatik, Hochschule Hannover, Tech. Rep., 2016. 1496, 2015. [Online]. Available: http://nbn-resolving.de/urn/resolver.pl?urn:nbn:de: [19] C. Queinnec, “An infrastructure for mechanised grading,” in Proc. 2nd bsz:960-opus4-9418 Int’l Conf. Computer Supported Education, 2010, pp. 37–45. [9] B. Herres, R. Oechsle, and D. Schuster, “Der grader asb,” in Automa- [20] N. Sclater and K. Howie, “User requirements of the “ultimate” online tisierte Bewertung in der Programmierausbildung, ser. Digitale Medien assessment engine,” Computers & Education, vol. 40, no. 3, pp. 285 – in der Hochschullehre, O. J. Bott, P. Fricke, U. Priss, and M. Striewe, 306, 2003. [Online]. Available: https://doi.org/10.1016/S0360-1315(02) Eds. Waxmann Verlag GmbH, 2017, vol. 6, pp. 255–271. 00132-X [10] C. A. Higgins et al., “Automated assessment and experiences of [21] J. Spacco et al., “Experiences with Marmoset,” Univ. Maryland, Tech. teaching programming,” J. Educ. Resour. Comput., vol. 5, no. 3, Sep. Rep., 2006. 2005. [Online]. Available: http://doi.acm.org/10.1145/1163405.1163410 [22] S. Strickroth et al., “ProFormA: An XML-based exchange format for [11] A. Hoffmann et al., “Online-übungssystem für die Programmierausbil- programming tasks,” e-learning & educaton (eleed), vol. 11, no. 1, 2015. dung zur Einführung in die Informatik,” in 6te e-Learning Fachtagung [Online]. Available: http://nbn-resolving.de/urn:nbn:de:0009-5-41389 Informatik (DeLFI’08), ser. LNI, S. Seehusen et al., Eds., vol. 132. GI, [23] M. Striewe, “An architecture for modular grading and 2008, pp. 173–184. feedback generation for complex exercises,” Science of [12] L. Iffländer et al., “PABS – a programming assignment feedback Computer Programming, vol. 129, pp. 35–47, 2016, special system,” in Proc. 2nd Workshop “Automatische Bewertung von issue on eLearning Software Architectures. [Online]. Available: Programmieraufgaben” (ABP’15), ser. CEUR-WS, U. Priss and http://www.sciencedirect.com/science/article/pii/S0167642316300260 M. Striewe, Eds., vol. 1496, 2015, pp. 5–1–5–8. [Online]. Available: [24] A. Zeller, “Making students read and review code,” in Proc. http://ceur-ws.org/Vol-1496/paper5.pdf 5th Annual Conf. Innovation and Technology in Computer Science Education (ITiCSE’00). ACM, 2000, pp. 89–92. [Online]. Available: http://doi.acm.org/10.1145/343048.343090 Software Engineering für E-Learning-Systeme @ SE18, Ulm, Germany 61