1. Introduction

Extracting Unit Tests from Patterns Mined in Student Code to Provide Improved Feedback in Autograders

Julien Lienard

Kim Mens

Siegfried Nijssen

0 0 ICTEAM , UCLouvain , Belgium

CS1 courses with large student numbers commonly use autograders to provide students automated feedback on basic programming exercises. Programming such feedback to integrate it into an autograder is a non-trivial and time-consuming task for teachers. Furthermore, such feedback is often based only on expected outputs for a given input, or on the teacher's perception of errors that students may make, rather than on the errors they actually make. We present an early implementation of a tool prototype and supporting methodology to address these problems. After mining the source code of earlier students' responses to exercises for frequent structural patterns, and classifying the found patterns according to these students' scores, our tool automatically generates unit tests that correspond to bad practices, errors or code smells observed in students' submissions. These unit tests can then be used or adapted by a teacher to integrate them into an autograder, in order to provide feedback of higher quality to future generations of students.

eol>CS education pattern mining unit testing code generation autograders automated feedback

1. Introduction Automated graders [1] are often used in the context of

introductory programming courses to assist students by providing automated feedback on their programming exercises. This feedback should be of high quality and as detailed as possible, so that it can help the students to learn from and correct their errors autonomously.

Motivation Creating such automated feedback to be integrated in an

autograder is a non-trivial and time-consuming task for teachers. As a consequence, often they limit the feedback to what can be checked by comparing the output of students’ programs on given inputs to the expected outputs. In this work, we argue that more qualitative feedback could be derived from structural regularities discovered in the source code of students’ submissions.

Part of the challenge is to decide what regularities are most relevant for this purpose, and secondly how to transform these regularities into tests that could be run on students’ submitted source code. Our proposal is to generate such tests after using a combination of a code mining algorithm and a manual analysis of the outcome of this code mining process by teachers. This data mining algorithm looks for frequent code patterns that are more representative for bad solutions than good solutions.

Reviewing earlier submissions can reveal a lot about how students approach problems and come up with answers. Teachers may be able to provide better feedback to students if they are aware of these patterns in typical good and bad solutions. Students who struggle with a particular exercise may benefit from seeing positive instances of ‘good’ solutions as well as from identifying ‘poor’ patterns in their code in order to understand why it is incorrect.

Research Questions

Our long-term goal is to aid teachers of introductory programming courses to include relevant feedback in autograders, in order for students to better understand and correct their mistakes. The main research questions of this paper are: 1. Can we discover recurrent error patterns in students’ code and use them to create more advanced feedback? 2. Can we automatically generate test suites from such identified error patterns, so that instances of these errors can be detected and reported automatically to students?

To answer the first research question, we use a data

mining algorithm called FREQTALS for finding frequent tree patterns in students’ source code [ 2 ]. This technique was tested on a dataset of student code submissions and was shown previously to be able to find interesting trends representative of good and bad coding idioms. [ 3 ]

To tackle the second research question, we develop an automated tool to generate unit tests for the discovered patterns in previous student submissions. This tool can help teachers incorporate these unit tests into an methods such as parsing and type checking. Depending autograder, to provide more comprehensive feedback to on the programming language used and the kind of feedstudents. We believe that our tool prototype has the po- back given, various methods can be helpful in diferent tential to enhance instruction efectiveness and improve situations. The need to manage multiple programming students’ learning outcomes. languages, the complexity of natural language creation,

The focus of the current paper is mainly on the process and the issue of assessing the quality of the generated of extracting unit tests from selected patterns. The min- feedback were some of the challenges and restrictions ing algorithm itself was already reported upon before, the authors mentioned. and extending the unit tests with detailed feedback that Our proposed technique tries to improve the kind of helps students overcome their errors is mostly a manual feedback given to students on programming exercises by efort still. helping teachers identify recurrent errors and to automatically generate unit tests that can detect such errors Validation in students’ code. We do not yet automatically produce a description of the feedback in natural language for those As initial ongoing validation of our tool prototype we unit tests. compiled a sizable dataset of Python code taken from an Solution errors occur in programs when they do not perinstance of the INGInious [ 1 ] platform, an autograder form as expected. These errors can manifest themselves that enables students to submit and receive feedback on in the form of logic errors, when a program does not programming exercises. The dataset corresponded to a correctly accomplish a desired task. Our technique aims ifnal exam for an entry-level programming course with to transform detected knowledge about mistakes into au569 students. We applied the FREQTALS code mining al- tomated feedback capturing solution errors to be included gorithm to this dataset to identify common patterns, and into an autograder. then generated unit tests to check which of these patterns Narciss [ 5 ] argues that feedback is very important are present in the source code of the student submissions. when it comes to online learning. Having a feedback A teacher selected and augmented a few of these patterns loop helps students to really understand and complete to give more advanced feedback to students, based on the their exercises successfully. She also showed that oferdiscovered structure in their submitted code. The gen- ing answer-until-correct (AUC) exercises with feedback erated unit tests for these patterns together with their is generally more efective than feedback on ‘one-shot’ additional feedback were then added to the autograder, exercises. AUC feedback alone may not be suficient for to be used by students to prepare for future exam ses- some learning activities however, according to her study, sions. A more thorough analysis of how students use and which suggests that it should be coupled with rich elaboappreciate this feedback will be the focus of a follow-up rated feedback. AUC feedback is advantageous because it paper. provides students with numerous chances to apply what they have learned and fix their mistakes, improving their 2. Related Work memory of the material. In this work, we try to make feedback more rich by making it more personalised, by comparing the student’s submission to recurring errors that have been detected in submissions by earlier generations of students.

The development of automated methods to generate feedback on programming exercises is an important field when it comes to helping students who learn how to program. Importance of feedback

A systematic literature review on automated feedback for programming exercises was conducted by Keuning et al. [ 4 ]. Their survey analysed and categorised 69 papers according to 4 factors: 1. Kind of feedback; 2. Technique used for the feedback; 3. Adaptability by the teachers; 4. Quality of the feedback.

According to this survey, the majority of current methods to generate automatic feedback for programming exercises look for coding mistakes by using static analysis

Feedback generation

The Concepts and Skills based Feedback Generation Framework (CSF2) was proposed by Haldeman et al. [ 6 ] for designing programming assignments, analyzing student submissions, and generating hints for assisting students in correcting their errors. In this method, the set of concepts and skills, called knowledge map, that students are required to master to complete the assignment is taken into consideration while creating the task, creating the test suite, gathering and organizing submissions, and reifning the test suite and knowledge map. Errors are then mapped to concepts and skills into what they call a bucket.

New tests are then created by hand to assess the common errors found in each bucket. The described framework presents a useful structure for improving classroom in- Other sources of inspiration struction and identifying students’ errors and misconceptions. The framework is described as a sequence of steps. As stated in the introduction, our paper can be regarded Our proposed approach is complementary in the sense as an extension the experiment described by Mens et that it could automate some of these steps, such as the al. [ 3 ]. They used pattern mining on students’ code subcreation of test suites. missions to detect recurring errors made by students

One of the state of the art tools to create personal- participating in a first-year programming course. Our ized feedback is AutoGrader [ 7 ]. Teachers first provide a main contribution is the addition of test generation for reference solution and a list of possible mistakes before matching the discovered patterns, in order to apply it to students can use AutoGrader. This information is then new students’ code submissions. The goal of this extenused by AutoGrader to generate potential code modifica- sion is to, after having discovered frequently occurring tions that students could apply to reach the right answer. mistakes using the technique described by Mens et al., The authors start from the hypothesis that student mis- generate tests that can make students aware of such mistakes can be predicted by teachers. In our approach, takes and as such help them correct and avoid them in the we do not make that assumption and instead attempt to future. We created a first tool prototype, used it to create identify frequent mistakes made by students by studying a test task for students using this generation technique, assignments submitted by prior student generations. and are currently analysing initial results to see if the

Another interesting automated tool, closer to what we feedback received helps students correct their mistakes. propose, is CARLA [ 8 ]. It uses existing correct student solutions in order to try and repair the code of students 3. Methodology who made mistakes. In order to do so, it first groups accurate student answers using the concept of dynamic Figure 1 summarises the diferent steps of our approach equivalence. The clustering, which serves as basis for and tools to help teachers integrate more advanced feedthe repair process, groups together similar correct solu- back for students in an autograder, based on patterns tions. The repair algorithm then uses expressions from observed in previous students’ submissions. The first several correct answers that are located in a same cluster step consist of gathering the source code of these prior to build minimum corrections for an incorrect student at- submissions (cf. 3.1). The second step extracts abstract tempt. The strategy reduces the adjustments required to syntax trees from this source code to be given as input to make a student reach a proper solution by using the most the tree mining algorithm, which then finds frequently comparable accurate solutions, written by other students. occurring patterns in those ASTs (cf. 3.2). From the list of Our technique instead uses all students’ code (not only discovered patterns, in step 3 teachers select the most incorrect solutions) and, rather than using dynamic equiv- teresting ones that they would like to convert into python alence, uses frequent tree pattern mining to group them. unit tests (cf. 3.3). They can then adapt and integrate From those groups we create tests that correspond to the these generated unit tests into the autograder to provide patterns found and let teachers pick the most interesting additional feedback to the students based on the strucones and add meaningful feedback to those tests. Our ture of their code to help them understand and correct tool thus doesn’t focus on code repair but rather on the their mistakes (cf. 3.4). automatic creation of unit tests.

Our work is comparable to a tool called Codewebs 3.1. Data gathering proposed by Nguyen et al. [ 9 ]. Their strategy uses the solutions of prior submissions to create customized feed- To gather a suficiently large data sample, we had access back for large MOOCs. They created a queryable index to the source code submitted by students on small profor quick searches into a sizable dataset of student as- gramming exercises and exam questions in the context of signment entries by breaking down online homework an entry-level programming course at university. They submissions into a vocabulary of “code phrases”. They submitted their code on and received feedback from the argue that by force-multiplying teacher efort, tools like INGInious platform 1, an open source and online teaching Codewebs can help improve the quality of free education. assistant used by several universities around the world. They use shared structure among submissions to create For the experiment described in this paper, we collected specific feedback on student errors. Both our work and the code submitted by students in response to an exam Codewebs use the data from student submissions from question, which asked students to write a program that a MOOC, but our approach uses a diferent method to computes the prime factors of a given number. identify shared structure in student submissions. The following signature and specification of the function they had to write was given to the students:

XML

Feedback 1–8 1. Data gathering 2.a Extract AST 2.b Mine Pattern in AST 3. Test Generation 4. Feedback Creation d e f f a c t o r s ( n ) : represents a part of an AST. FREQTALS is based on the " " " FREQT algorithm [ 10 ] and identifies frequent patterns P r e : n a s t r i c t l y p o s i t i v e i n t e g e r in a dataset by iteratively generating candidate sets of P o s t : r e t u r n s i n t e g e r r e p r e s e n t i n g patterns and pruning those that do not meet a minimum number o f p r i m e f a c t o r s o f n support threshold. FREQTALS requires the ASTs it takes " " " as input to be in a particular XML tree format. Our tool # c o d e t o c o m p l e t e makes sure this format is followed, by using a Python parsing library 2 to convert the Python programs written together with an example of what results the function by students into the format required by the miner. Code should return for some given cases: that could not be parsed was not considered for the test f a c t o r s ( 3 3 ) # r e t u r n s 2 ( 3 3 = 3 x 1 1 ) generation. f a c t o r s ( 1 2 ) # r e t u r n s 3 ( 1 2 = 2 x 2 x 3 ) FREQTALS can be configured with various constraints, f a c t o r s ( 8 ) # r e t u r n s 3 ( 8 = 2 x 2 x 2 ) such as the minimum number of files in which the pattern f a c t o r s ( 1 2 7 ) # r e t u r n s 1 ( 1 2 7 p r i m e ) can be found, the minimum or maximum size of the patterns, and the list of allowed root nodes (i.e. what

The exam context guarantees that students responded types of syntactic constructs, such as Python function to the questions individually and under similar condi- bodies, the miner should focus on). All these constraints tions, on university computers in a classroom setting. were configured as described in the original paper. Furthermore, whereas students received automated feed- As the ASTs of the students’ code are tree-like repback throughout the semester and when preparing for the resentations, FREQTALS searches for patterns in tree exam, during the exam itself the autograder’s feedback structures. It takes into account the structural characteroption was turned of. This implies that the patterns istics of the patterns, such as the placement of the child which we discovered after analysing the student’s re- nodes and the frequency with which particular labels sponses to the exam questions were not biased by any appear in the tree. The pattern miner discovers diferent feedback they received on those questions. types of patterns, including control flow patterns, data

As in the mining experiment described by Mens et structure patterns, and algorithm pattern types. Comal. [ 3 ], we separated the dataset in 2 groups: students mon programming constructs like loops, conditionals who obtained a score of 50% or more on the question and and function calls can be included in these patterns, as those who obtained a score of less than 50%. This score well as more specialized language constructs like list was calculated automatically by running a set of unit tests comprehension or generator expressions. that checked diferent input-output pairs that a correct We used the FREQTALS tree miner to find common solution should respect. Taking these two subsets as patterns in the ASTs of the programs submitted by both input, the mining algorithm would try and find patterns the group of students who obtained 50% or more and by that are more representative for one set than for the other. the group who scored less than 50%, in order to assess the eficacy of our tool. Whereas we kept the same configura3.2. Pattern mining tion settings as in the original paper, we did experiment with diferent values to set the threshold for a pattern to be considered as ‘frequent’. We experimentally deter

FREQTALS [2] is a tree mining algorithm designed to

identify frequent patterns in the abstract syntax trees (ASTs) of programs. Each pattern is a tree structure that

2https://docs.python.org/3/library/ast.html#ast.parse

Listing 1: Example of an augmented unit test written using our technique.

The pattern miner’s output is an XML file containing the tree structure of all discovered patterns. We transform this output into a set of Python unit tests. We want each resulting unit test to check whether a code submission matches a pattern found by the mining algorithm. To do so, for a given pattern, our unit test generation tool traverses its structure. For each node of the pattern’s 4. Initial Experiment tree, the tool generates a Python code block that searches for corresponding AST nodes in the code to be checked. In this section we will describe the results of an initial If the pattern node has children, the tool searches the experiment we conducted to assess the pattern mining children of the found AST node for matches in its children. tool and a first prototype implementation of our unit test This search is repeated until all nodes in the pattern’s generation tool.

XML structure have been checked, but if no match is discovered, the tool backtracks to the next matching AST 4.1. Patterns node. A fragment of such a generated unit test can be found in Listing 4 in Subsection 4.2.

For the exam question mentioned in subsection 3.1, we

gathered the code submitted by a total of 569 students.

Of these, 560 were parsed successfully and included in 3.4. Feedback creation our analysis. Most of the code that didn’t parse was Our longer-term research goal is to assist teachers in unfinished code or code containing syntax errors. The providing students with more insightful feedback on their code was then split into two groups: 149 submissions for exercises. The generated tests are to be used to evaluate which the students obtained a score of less than 50%, and students’ code automatically and to be augmented with 411 submissions with a score of 50% or higher. (These more specific feedback on the errors they make. scores were calculated automatically with an autograder,

After having run the pattern miner on previous stu- after completion of the exam, using a set of graded unit dents’ submissions to existing assignments, teachers can tests that were programmed by the teacher beforehand.) select any pattern that catches their attention, and gener- Next, playing the role of a teacher, we analysed all 147 ate a Python unit test for it. They can then add additional patterns mined by the FREQTALS algorithm to identify feedback to that unit test and include it into the auto- which of them could be interesting or potentially useful grader for that assignment, or they can use it to design a to transform into unit tests. This analysis included a new assignment around that pattern. review of the number of times each pattern appeared

The generated test will fail and reveal problems in stu- in the code of the high-scoring and low-scoring groups, dents’ code if it does not adhere to the pattern. Students as well as a manual review of the patterns themselves can then rely on this augmented feedback provided by to identify if they captured a common mistake made by the teacher in order to help them fix their mistakes (see several students. Other criteria we used in our selection Listing 1). Compared to just receiving a mark or general were whether the pattern’s absence or presence afected the code’s quality, the total number of occurrences of the <Assign> <__directives>

<match-sequence/> </__directives> <targets> <Name> <identifier> <Dummy>

...

</Dummy> </identifier> </Name> </targets> <value> <List> <elts> <__directives>

<match-sequence/> </__directives> <Constant> <identifier> <Dummy>

...

</Dummy> </identifier> </Constant> ...

</elts> </List> </value> </Assign>

targets Assign value

Name List elts

Constant Constant Constant Constant Constant Constant

Listing 2: Example of a student’s bad practice: hard cod

ing the list of prime numbers to check. d e f f a c t o r s ( n ) : l = [ 2,3,5,7,11,13,17,19 ] # l i s t o f p r i m e s i = 0 t = [ ] f o r i i n range ( l e n ( l ) ) : i f n% l [ i ] == 0 : t . a p p e n d ( l [ i ] ) i −= 1 n b r = l e n ( t ) r e t u r n ( n b r ) Listing 3: Example of a student’s code close to the correct

solution, but confounding ‘if’ and ‘while’. d e f f a c t o r s ( n ) : c o u n t =0 i f n = = 1 :

r e t u r n 0 i f n < 0 :

r e t u r n F a l s e s =n for i in range (2,n): if n%i==0: s=s/i p r i n t ( s ) e l s e :

c o u n t +=1 r e t u r n c o u n t pattern and the diference between matching code in the high-scoring and low-scoring groups.

We discovered several patterns that could reveal potential bad practices or errors in students’ code. An example of such a bad practice is illustrated in Listing 2. This code corresponds to a match of the second pattern that was found. The corresponding XML pattern and AST representation are shown in Figure 2.

The part of the code that matches the pattern is high- found by the miner. One such error is the overuse of lighted in red in Listing 2. The pattern matches code nested loops. Even if we found some cases where stufragments where students hard coded the list of prime dents achieved to create correct code using more than factors to check. We found 6 students’ code that matched 2 nested loops, we did observe that such code becomes this pattern. A high-level feedback that the teacher may quite hard to read and is to be avoided. want to add for this pattern is that computing those num- It should also be said that not all patterns that were bers is probably a more generic solution than hard-coding mined were of interest. Some of them do capture rethem in a list. curring Python code fragments but do not necessarily

Listing 3 is an occurrence of another pattern showing provide any information on the student’s code quality. a student’s code that is close to a good solution. Unfortu- For example, pattern number 60 described a code fragnately the student should have used a while-statement ment containing an if with a return inside followed instead of an if-statement at line 9. This pattern was by a return at the end of the function. That by itself found in the code of 8 students. A teacher may want to doesn’t say anything about whether the code is good provide as feedback to these students that they may have or bad. It is the teacher’s job to filter, from all patterns confused an ‘if’ for a ‘while’. discovered by the miner, the most relevant ones worth

In addition to analysing the discovered patterns them- checking. selves, the mere fact of having to walk through code fragments that match certain patterns, sometimes leads to the discovery of other recurrent errors that were not

Listing 4: Example of test block generated for the pattern 2 that will search a function definition inside all blocks in the AST 4.2. Test generation blocks is shown in Listing 4. In the first block (lines 2-5), we search for all instances of a ‘body’ node in the The FREQTALS algorithm identified a total of 147 fre- code. If no ‘body’ nodes are found, the pattern cannot quent patterns in the code submitted by both groups of be matched (lines 3-4). Then, we iterate through each students. The patterns can occur in one or both the high- identified ‘body’ node (line 5). The second block is visible scoring and low-scoring groups. When a same pattern from lines 6 to 9. occurs in the high-scoring group or in both groups at The first block employs the ast.walk()3 functhe same time, it often means that the pattern is either tion to search for the root node of the pattern anya part of or close to a good solution (such as the pattern where in the code, while the second block uses depicted in Listing 3) or an uninteresting pattern. When ast.iter_child_nodes()4 to search only within the a pattern can only be found in the low-scoring group, it direct child nodes of the currently matched node. A is often a pattern representing an error or bad practice return statement only appears in the first block, while (such as the pattern depicted in Listing 2) but sometimes the continue statement on line 8 allows us to fall back to also a pattern of no interest. For each of the discovered the next matching node. This same structure is followed patterns, our tool could produce a unit test. by all of the other test blocks as well.

It should be noted that out of the 147 frequent code pat- We believe that, to some extent, this block structure terns found, about two thirds were patterns too generic allows us to reach an acceptable trade-of between readto be useful (for example a pattern that just matches the ability and correctness of the test. We do think that, presence of two assignment statements in the code). Of once a teacher understands the block structure of the the remaining patterns, about half were representative generated unit tests, the tests remain suficiently underof code fragments occurring in good solutions and the standable. Adapting the tests slightly to capture slight other half were patterns occurring mostly in the bad solu- variations of the pattern is possible, although not trivial. tions. Some patterns often tend to be quite close to other patterns as well, so in the end the teacher kept only a handful of relevant patterns to be transformed into unit 4.3. Feedback creation tests. For our dataset consisting of student submissions for an

Apart from this problem of having to wade manually exam question, we created 4 unit tests generated from or through the many patterns found to retain only a few inspired by the mined patterns. We ofered this question, relevant ones, we also faced another issue which was with the additional unit tests adorned with the dedicated to ensure that the Python code generated for unit tests feedback we added, to new students as a revision exercise remained understandable to teachers. This is desirable to help them prepare for an upcoming exam session. Of as it allows teachers to understand the pattern just by the 4 unit tests that we added, 3 of them were generated looking at the Python code. It also allows them to adapt directly from the patterns found by FREQTALS, including the tests afterwards if need be. This is not easily achieved, the two patterns shown in red in Listings 2 and 3. The however, since we want our generated unit tests to match third pattern that we used was one which represents the exactly the same code fragments as the patterns found by usage of return inside of loops. The last one did not the miner and since this matching is non-trivial, amongst correspond to a mined pattern but rather to a recurrent others due to the need to for backtracking.

To keep the tests understandable, we construct them using test blocks. An example of two combined test

3https://docs.python.org/3/library/ast.html#ast.walk

4https://docs.python.org/3/library/ast.html#ast.iter_child_ nodes error that we discovered manually in the students’ code while analysing the dataset, as explained in Section 4.1.

At the time of writing, about 66 students have given this revision exercise at least one try. Of those, 23 students received at least one of the 4 dedicated feedbacks that we added to this exercise. We analysed the submission of those 23 students and saw that the given feedback was useful for at least 6 of them. Indeed, we could observe that, after they took the feedback into account, the students obtained a better score for the question and didn’t match the pattern anymore after they corrected their code.

5. Conclusion

In this paper we explored whether we could identify recurrent patterns highlighting errors or bad practices in students’ code and use them to create more advanced feedback to be included in an autograder. We did so by ifrst mining the solutions of a large amount of students for frequently occurring patterns using the FREQTALS tree mining algorithm. We then manually analysed each of these patterns to find those that match typical mistakes that many students seem to make in their solution. We also found that for some of these patterns, creating advanced feedback when they are detected could help student produce code of better quality.

Our second step was then, for the selected patterns, to automatically extract unit tests that match exactly the same source code as the mined patterns. These unit tests, which would thus check for typical coding errors made by students, could then be extended by a teacher with a more personalized feedback on that particular kind of error. These unit tests can then be included in an autograder so that students making these mistakes get more accurate feedback on the errors they make.

As both the FREQTALS miner and the test generation tool in essence only require an AST to mine for patterns and generate the tests, these tools can easily be applied to other programming languages than Python as well.

In this work in progress paper, we mainly presented our vision and the current implementation of our tool prototype. Although we have started exploring the usage of this tool on real data, for now the purpose of the validation was more to ensure that the tool has potential and is working correctly. Obviously a more thorough and in-depth validation on more data and with real students and teachers is still needed. On the one hand we want to study and understand how teachers use the tool and how the tool can be improved further to satisfy their needs. On the other hand we want to validate that the students efectively benefit from the improved feedback provided by the generated test suites (i.e., better scores and better code quality).

[1]

Derval ,

Gego ,

Reinbold ,

Frantzen , P. Van Roy , Automatic grading of programming exercises in a MOOC using the INGInious platform, European Stakeholder Summit on experiences and best practices in and around MOOCs (EMOOCS'15) ( 2015 ) 86 - 91 .

[2]

H. S.

Pham ,

Nijssen ,

Mens ,

D. D.

Nucci ,

Molderez ,

C. D.

Roover ,

Fabry ,

Zaytsev , Mining patterns in source code using tree mining algorithms , in: International Conference on Discovery Science, Springer, 2019 , pp. 471 - 480 .

[3]

Mens ,

Nijssen , H.-S. Pham, The good, the bad, and the ugly: mining for patterns in student source code , in: Proceedings of the 3rd International Workshop on Education through Advanced Software Engineering and Artificial Intelligence , 2021 , pp. 1 - 8 .

[4]

Keuning ,

Jeuring ,

Heeren , Towards a systematic review of automated feedback generation for programming exercises, Association for Computing Machinery , New York, NY, USA, 2016 .

[5]

Narciss , Feedback Strategies for Interactive Learning Tasks, 2008 , pp. 125 - 144 .

[6]

Haldeman ,

Tjang ,

Babeş-Vroman ,

Bartos ,

Shah ,

Yucht ,

T. D.

Nguyen , Providing meaningful feedback for autograding of programming assignments, Association for Computing Machinery , New York, NY, USA, 2018 .

[7]

Singh ,

Gulwani ,

Solar-Lezama , Automated feedback generation for introductory programming assignments , in: Proceedings of the 34th ACM SIGPLAN conference on Programming language design and implementation , 2013 , pp. 15 - 26 .

[8]

Gulwani ,

Radiček ,

Zuleger , Automated clustering and program repair for introductory programming assignments , ACM SIGPLAN Notices 53 ( 2018 ) 465 - 480 .

[9]

Nguyen ,

Piech ,

Huang ,

Guibas , Codewebs: Scalable homework search for massive open online programming courses , in: Proceedings of the 23rd International Conference on World Wide Web, ACM , 2014 , pp. 491 - 502 . doi: 10 .1145/2566486. 2568023.

[10]

Asai ,

Abe ,

Kawasoe ,

Sakamoto ,

Arimura ,

Arikawa , Eficient substructure discovery from large semi-structured data , IEICE TRANSACTIONS on Information and Systems 87 ( 2004 ) 2754 - 2763 .