<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Extracting Unit Tests from Patterns Mined in Student Code to Provide Improved Feedback in Autograders</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Julien Lienard</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Kim Mens</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Siegfried Nijssen</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>ICTEAM</institution>
          ,
          <addr-line>UCLouvain</addr-line>
          ,
          <country country="BE">Belgium</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>CS1 courses with large student numbers commonly use autograders to provide students automated feedback on basic programming exercises. Programming such feedback to integrate it into an autograder is a non-trivial and time-consuming task for teachers. Furthermore, such feedback is often based only on expected outputs for a given input, or on the teacher's perception of errors that students may make, rather than on the errors they actually make. We present an early implementation of a tool prototype and supporting methodology to address these problems. After mining the source code of earlier students' responses to exercises for frequent structural patterns, and classifying the found patterns according to these students' scores, our tool automatically generates unit tests that correspond to bad practices, errors or code smells observed in students' submissions. These unit tests can then be used or adapted by a teacher to integrate them into an autograder, in order to provide feedback of higher quality to future generations of students.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;CS education</kwd>
        <kwd>pattern mining</kwd>
        <kwd>unit testing</kwd>
        <kwd>code generation</kwd>
        <kwd>autograders</kwd>
        <kwd>automated feedback</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <sec id="sec-1-1">
        <title>Automated graders [1] are often used in the context of</title>
        <p>introductory programming courses to assist students by
providing automated feedback on their programming
exercises. This feedback should be of high quality and
as detailed as possible, so that it can help the students to
learn from and correct their errors autonomously.</p>
        <sec id="sec-1-1-1">
          <title>Motivation</title>
        </sec>
      </sec>
      <sec id="sec-1-2">
        <title>Creating such automated feedback to be integrated in an</title>
        <p>autograder is a non-trivial and time-consuming task for
teachers. As a consequence, often they limit the feedback
to what can be checked by comparing the output of
students’ programs on given inputs to the expected outputs.
In this work, we argue that more qualitative feedback
could be derived from structural regularities discovered
in the source code of students’ submissions.</p>
        <p>Part of the challenge is to decide what regularities
are most relevant for this purpose, and secondly how to
transform these regularities into tests that could be run
on students’ submitted source code. Our proposal is to
generate such tests after using a combination of a code
mining algorithm and a manual analysis of the outcome
of this code mining process by teachers. This data mining
algorithm looks for frequent code patterns that are more
representative for bad solutions than good solutions.</p>
        <p>Reviewing earlier submissions can reveal a lot about
how students approach problems and come up with
answers. Teachers may be able to provide better feedback
to students if they are aware of these patterns in typical
good and bad solutions. Students who struggle with a
particular exercise may benefit from seeing positive
instances of ‘good’ solutions as well as from identifying
‘poor’ patterns in their code in order to understand why
it is incorrect.</p>
        <sec id="sec-1-2-1">
          <title>Research Questions</title>
          <p>Our long-term goal is to aid teachers of introductory
programming courses to include relevant feedback in
autograders, in order for students to better understand
and correct their mistakes. The main research questions
of this paper are:
1. Can we discover recurrent error patterns in
students’ code and use them to create more advanced
feedback?
2. Can we automatically generate test suites from
such identified error patterns, so that instances
of these errors can be detected and reported
automatically to students?</p>
        </sec>
      </sec>
      <sec id="sec-1-3">
        <title>To answer the first research question, we use a data</title>
        <p>
          mining algorithm called FREQTALS for finding frequent
tree patterns in students’ source code [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ]. This technique
was tested on a dataset of student code submissions and
was shown previously to be able to find interesting trends
representative of good and bad coding idioms. [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]
        </p>
        <p>To tackle the second research question, we develop
an automated tool to generate unit tests for the
discovered patterns in previous student submissions. This tool
can help teachers incorporate these unit tests into an methods such as parsing and type checking. Depending
autograder, to provide more comprehensive feedback to on the programming language used and the kind of
feedstudents. We believe that our tool prototype has the po- back given, various methods can be helpful in diferent
tential to enhance instruction efectiveness and improve situations. The need to manage multiple programming
students’ learning outcomes. languages, the complexity of natural language creation,</p>
        <p>
          The focus of the current paper is mainly on the process and the issue of assessing the quality of the generated
of extracting unit tests from selected patterns. The min- feedback were some of the challenges and restrictions
ing algorithm itself was already reported upon before, the authors mentioned.
and extending the unit tests with detailed feedback that Our proposed technique tries to improve the kind of
helps students overcome their errors is mostly a manual feedback given to students on programming exercises by
efort still. helping teachers identify recurrent errors and to
automatically generate unit tests that can detect such errors
Validation in students’ code. We do not yet automatically produce a
description of the feedback in natural language for those
As initial ongoing validation of our tool prototype we unit tests.
compiled a sizable dataset of Python code taken from an Solution errors occur in programs when they do not
perinstance of the INGInious [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ] platform, an autograder form as expected. These errors can manifest themselves
that enables students to submit and receive feedback on in the form of logic errors, when a program does not
programming exercises. The dataset corresponded to a correctly accomplish a desired task. Our technique aims
ifnal exam for an entry-level programming course with to transform detected knowledge about mistakes into
au569 students. We applied the FREQTALS code mining al- tomated feedback capturing solution errors to be included
gorithm to this dataset to identify common patterns, and into an autograder.
then generated unit tests to check which of these patterns Narciss [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] argues that feedback is very important
are present in the source code of the student submissions. when it comes to online learning. Having a feedback
A teacher selected and augmented a few of these patterns loop helps students to really understand and complete
to give more advanced feedback to students, based on the their exercises successfully. She also showed that
oferdiscovered structure in their submitted code. The gen- ing answer-until-correct (AUC) exercises with feedback
erated unit tests for these patterns together with their is generally more efective than feedback on ‘one-shot’
additional feedback were then added to the autograder, exercises. AUC feedback alone may not be suficient for
to be used by students to prepare for future exam ses- some learning activities however, according to her study,
sions. A more thorough analysis of how students use and which suggests that it should be coupled with rich
elaboappreciate this feedback will be the focus of a follow-up rated feedback. AUC feedback is advantageous because it
paper. provides students with numerous chances to apply what
they have learned and fix their mistakes, improving their
2. Related Work memory of the material. In this work, we try to make
feedback more rich by making it more personalised, by
comparing the student’s submission to recurring errors
that have been detected in submissions by earlier
generations of students.
        </p>
      </sec>
      <sec id="sec-1-4">
        <title>The development of automated methods to generate feedback on programming exercises is an important field when it comes to helping students who learn how to program.</title>
        <sec id="sec-1-4-1">
          <title>Importance of feedback</title>
          <p>
            A systematic literature review on automated feedback
for programming exercises was conducted by Keuning et
al. [
            <xref ref-type="bibr" rid="ref4">4</xref>
            ]. Their survey analysed and categorised 69 papers
according to 4 factors:
1. Kind of feedback;
2. Technique used for the feedback;
3. Adaptability by the teachers;
4. Quality of the feedback.
          </p>
          <p>According to this survey, the majority of current
methods to generate automatic feedback for programming
exercises look for coding mistakes by using static analysis</p>
        </sec>
        <sec id="sec-1-4-2">
          <title>Feedback generation</title>
          <p>
            The Concepts and Skills based Feedback Generation
Framework (CSF2) was proposed by Haldeman et al. [
            <xref ref-type="bibr" rid="ref6">6</xref>
            ] for
designing programming assignments, analyzing student
submissions, and generating hints for assisting students
in correcting their errors. In this method, the set of
concepts and skills, called knowledge map, that students are
required to master to complete the assignment is taken
into consideration while creating the task, creating the
test suite, gathering and organizing submissions, and
reifning the test suite and knowledge map. Errors are then
mapped to concepts and skills into what they call a bucket.
          </p>
          <p>
            New tests are then created by hand to assess the common
errors found in each bucket. The described framework
presents a useful structure for improving classroom in- Other sources of inspiration
struction and identifying students’ errors and
misconceptions. The framework is described as a sequence of steps. As stated in the introduction, our paper can be regarded
Our proposed approach is complementary in the sense as an extension the experiment described by Mens et
that it could automate some of these steps, such as the al. [
            <xref ref-type="bibr" rid="ref3">3</xref>
            ]. They used pattern mining on students’ code
subcreation of test suites. missions to detect recurring errors made by students
          </p>
          <p>
            One of the state of the art tools to create personal- participating in a first-year programming course. Our
ized feedback is AutoGrader [
            <xref ref-type="bibr" rid="ref7">7</xref>
            ]. Teachers first provide a main contribution is the addition of test generation for
reference solution and a list of possible mistakes before matching the discovered patterns, in order to apply it to
students can use AutoGrader. This information is then new students’ code submissions. The goal of this
extenused by AutoGrader to generate potential code modifica- sion is to, after having discovered frequently occurring
tions that students could apply to reach the right answer. mistakes using the technique described by Mens et al.,
The authors start from the hypothesis that student mis- generate tests that can make students aware of such
mistakes can be predicted by teachers. In our approach, takes and as such help them correct and avoid them in the
we do not make that assumption and instead attempt to future. We created a first tool prototype, used it to create
identify frequent mistakes made by students by studying a test task for students using this generation technique,
assignments submitted by prior student generations. and are currently analysing initial results to see if the
          </p>
          <p>
            Another interesting automated tool, closer to what we feedback received helps students correct their mistakes.
propose, is CARLA [
            <xref ref-type="bibr" rid="ref8">8</xref>
            ]. It uses existing correct student
solutions in order to try and repair the code of students 3. Methodology
who made mistakes. In order to do so, it first groups
accurate student answers using the concept of dynamic Figure 1 summarises the diferent steps of our approach
equivalence. The clustering, which serves as basis for and tools to help teachers integrate more advanced
feedthe repair process, groups together similar correct solu- back for students in an autograder, based on patterns
tions. The repair algorithm then uses expressions from observed in previous students’ submissions. The first
several correct answers that are located in a same cluster step consist of gathering the source code of these prior
to build minimum corrections for an incorrect student at- submissions (cf. 3.1). The second step extracts abstract
tempt. The strategy reduces the adjustments required to syntax trees from this source code to be given as input to
make a student reach a proper solution by using the most the tree mining algorithm, which then finds frequently
comparable accurate solutions, written by other students. occurring patterns in those ASTs (cf. 3.2). From the list of
Our technique instead uses all students’ code (not only discovered patterns, in step 3 teachers select the most
incorrect solutions) and, rather than using dynamic equiv- teresting ones that they would like to convert into python
alence, uses frequent tree pattern mining to group them. unit tests (cf. 3.3). They can then adapt and integrate
From those groups we create tests that correspond to the these generated unit tests into the autograder to provide
patterns found and let teachers pick the most interesting additional feedback to the students based on the
strucones and add meaningful feedback to those tests. Our ture of their code to help them understand and correct
tool thus doesn’t focus on code repair but rather on the their mistakes (cf. 3.4).
automatic creation of unit tests.
          </p>
          <p>
            Our work is comparable to a tool called Codewebs 3.1. Data gathering
proposed by Nguyen et al. [
            <xref ref-type="bibr" rid="ref9">9</xref>
            ]. Their strategy uses the
solutions of prior submissions to create customized feed- To gather a suficiently large data sample, we had access
back for large MOOCs. They created a queryable index to the source code submitted by students on small
profor quick searches into a sizable dataset of student as- gramming exercises and exam questions in the context of
signment entries by breaking down online homework an entry-level programming course at university. They
submissions into a vocabulary of “code phrases”. They submitted their code on and received feedback from the
argue that by force-multiplying teacher efort, tools like INGInious platform 1, an open source and online teaching
Codewebs can help improve the quality of free education. assistant used by several universities around the world.
They use shared structure among submissions to create For the experiment described in this paper, we collected
specific feedback on student errors. Both our work and the code submitted by students in response to an exam
Codewebs use the data from student submissions from question, which asked students to write a program that
a MOOC, but our approach uses a diferent method to computes the prime factors of a given number.
identify shared structure in student submissions. The following signature and specification of the
function they had to write was given to the students:
          </p>
          <p>Py</p>
          <p>XML</p>
          <p>Py</p>
          <p>
            Feedback
1–8
1. Data gathering
2.a Extract AST
2.b Mine Pattern in AST
3. Test Generation 4. Feedback Creation
d e f f a c t o r s ( n ) : represents a part of an AST. FREQTALS is based on the
" " " FREQT algorithm [
            <xref ref-type="bibr" rid="ref10">10</xref>
            ] and identifies frequent patterns
P r e : n a s t r i c t l y p o s i t i v e i n t e g e r in a dataset by iteratively generating candidate sets of
P o s t : r e t u r n s i n t e g e r r e p r e s e n t i n g patterns and pruning those that do not meet a minimum
number o f p r i m e f a c t o r s o f n support threshold. FREQTALS requires the ASTs it takes
" " " as input to be in a particular XML tree format. Our tool
# c o d e t o c o m p l e t e makes sure this format is followed, by using a Python
parsing library 2 to convert the Python programs written
together with an example of what results the function by students into the format required by the miner. Code
should return for some given cases: that could not be parsed was not considered for the test
f a c t o r s ( 3 3 ) # r e t u r n s 2 ( 3 3 = 3 x 1 1 ) generation.
f a c t o r s ( 1 2 ) # r e t u r n s 3 ( 1 2 = 2 x 2 x 3 ) FREQTALS can be configured with various constraints,
f a c t o r s ( 8 ) # r e t u r n s 3 ( 8 = 2 x 2 x 2 ) such as the minimum number of files in which the pattern
f a c t o r s ( 1 2 7 ) # r e t u r n s 1 ( 1 2 7 p r i m e ) can be found, the minimum or maximum size of the
patterns, and the list of allowed root nodes (i.e. what
          </p>
          <p>The exam context guarantees that students responded types of syntactic constructs, such as Python function
to the questions individually and under similar condi- bodies, the miner should focus on). All these constraints
tions, on university computers in a classroom setting. were configured as described in the original paper.
Furthermore, whereas students received automated feed- As the ASTs of the students’ code are tree-like
repback throughout the semester and when preparing for the resentations, FREQTALS searches for patterns in tree
exam, during the exam itself the autograder’s feedback structures. It takes into account the structural
characteroption was turned of. This implies that the patterns istics of the patterns, such as the placement of the child
which we discovered after analysing the student’s re- nodes and the frequency with which particular labels
sponses to the exam questions were not biased by any appear in the tree. The pattern miner discovers diferent
feedback they received on those questions. types of patterns, including control flow patterns, data</p>
          <p>
            As in the mining experiment described by Mens et structure patterns, and algorithm pattern types.
Comal. [
            <xref ref-type="bibr" rid="ref3">3</xref>
            ], we separated the dataset in 2 groups: students mon programming constructs like loops, conditionals
who obtained a score of 50% or more on the question and and function calls can be included in these patterns, as
those who obtained a score of less than 50%. This score well as more specialized language constructs like list
was calculated automatically by running a set of unit tests comprehension or generator expressions.
that checked diferent input-output pairs that a correct We used the FREQTALS tree miner to find common
solution should respect. Taking these two subsets as patterns in the ASTs of the programs submitted by both
input, the mining algorithm would try and find patterns the group of students who obtained 50% or more and by
that are more representative for one set than for the other. the group who scored less than 50%, in order to assess the
eficacy of our tool. Whereas we kept the same
configura3.2. Pattern mining tion settings as in the original paper, we did experiment
with diferent values to set the threshold for a pattern
to be considered as ‘frequent’. We experimentally
deter
          </p>
        </sec>
      </sec>
      <sec id="sec-1-5">
        <title>FREQTALS [2] is a tree mining algorithm designed to</title>
        <p>identify frequent patterns in the abstract syntax trees
(ASTs) of programs. Each pattern is a tree structure that</p>
      </sec>
      <sec id="sec-1-6">
        <title>2https://docs.python.org/3/library/ast.html#ast.parse</title>
        <p>Listing 1: Example of an augmented unit test written using our technique.</p>
        <p>The pattern miner’s output is an XML file containing the
tree structure of all discovered patterns. We transform
this output into a set of Python unit tests. We want each
resulting unit test to check whether a code submission
matches a pattern found by the mining algorithm. To
do so, for a given pattern, our unit test generation tool
traverses its structure. For each node of the pattern’s 4. Initial Experiment
tree, the tool generates a Python code block that searches
for corresponding AST nodes in the code to be checked. In this section we will describe the results of an initial
If the pattern node has children, the tool searches the experiment we conducted to assess the pattern mining
children of the found AST node for matches in its children. tool and a first prototype implementation of our unit test
This search is repeated until all nodes in the pattern’s generation tool.</p>
        <p>XML structure have been checked, but if no match is
discovered, the tool backtracks to the next matching AST 4.1. Patterns
node. A fragment of such a generated unit test can be
found in Listing 4 in Subsection 4.2.</p>
      </sec>
      <sec id="sec-1-7">
        <title>For the exam question mentioned in subsection 3.1, we</title>
        <p>gathered the code submitted by a total of 569 students.</p>
        <p>Of these, 560 were parsed successfully and included in
3.4. Feedback creation our analysis. Most of the code that didn’t parse was
Our longer-term research goal is to assist teachers in unfinished code or code containing syntax errors. The
providing students with more insightful feedback on their code was then split into two groups: 149 submissions for
exercises. The generated tests are to be used to evaluate which the students obtained a score of less than 50%, and
students’ code automatically and to be augmented with 411 submissions with a score of 50% or higher. (These
more specific feedback on the errors they make. scores were calculated automatically with an autograder,</p>
        <p>After having run the pattern miner on previous stu- after completion of the exam, using a set of graded unit
dents’ submissions to existing assignments, teachers can tests that were programmed by the teacher beforehand.)
select any pattern that catches their attention, and gener- Next, playing the role of a teacher, we analysed all 147
ate a Python unit test for it. They can then add additional patterns mined by the FREQTALS algorithm to identify
feedback to that unit test and include it into the auto- which of them could be interesting or potentially useful
grader for that assignment, or they can use it to design a to transform into unit tests. This analysis included a
new assignment around that pattern. review of the number of times each pattern appeared</p>
        <p>The generated test will fail and reveal problems in stu- in the code of the high-scoring and low-scoring groups,
dents’ code if it does not adhere to the pattern. Students as well as a manual review of the patterns themselves
can then rely on this augmented feedback provided by to identify if they captured a common mistake made by
the teacher in order to help them fix their mistakes (see several students. Other criteria we used in our selection
Listing 1). Compared to just receiving a mark or general were whether the pattern’s absence or presence afected
the code’s quality, the total number of occurrences of the
&lt;Assign&gt;
&lt;__directives&gt;</p>
        <p>&lt;match-sequence/&gt;
&lt;/__directives&gt;
&lt;targets&gt;
&lt;Name&gt;
&lt;identifier&gt;
&lt;Dummy&gt;</p>
        <p>...</p>
        <p>&lt;/Dummy&gt;
&lt;/identifier&gt;
&lt;/Name&gt;
&lt;/targets&gt;
&lt;value&gt;
&lt;List&gt;
&lt;elts&gt;
&lt;__directives&gt;</p>
        <p>&lt;match-sequence/&gt;
&lt;/__directives&gt;
&lt;Constant&gt;
&lt;identifier&gt;
&lt;Dummy&gt;</p>
        <p>...</p>
        <p>&lt;/Dummy&gt;
&lt;/identifier&gt;
&lt;/Constant&gt;
...</p>
        <p>&lt;/elts&gt;
&lt;/List&gt;
&lt;/value&gt;
&lt;/Assign&gt;</p>
        <p>targets
Assign
value</p>
        <p>Name
List elts</p>
        <p>Constant
Constant
Constant
Constant
Constant
Constant</p>
        <p>Listing 2: Example of a student’s bad practice: hard
cod</p>
        <p>
          ing the list of prime numbers to check.
d e f f a c t o r s ( n ) :
l = [
          <xref ref-type="bibr" rid="ref2 ref3 ref5 ref7">2,3,5,7,11,13,17,19</xref>
          ] # l i s t o f p r i m e s
i = 0
t = [ ]
f o r i i n range ( l e n ( l ) ) :
i f n% l [ i ] == 0 :
t . a p p e n d ( l [ i ] )
i −= 1
n b r = l e n ( t )
r e t u r n ( n b r )
Listing 3: Example of a student’s code close to the correct
        </p>
        <p>solution, but confounding ‘if’ and ‘while’.
d e f f a c t o r s ( n ) :
c o u n t =0
i f n = = 1 :</p>
        <p>r e t u r n 0
i f n &lt; 0 :</p>
        <p>r e t u r n F a l s e
s =n
for i in range (2,n):
if n%i==0:
s=s/i
p r i n t ( s )
e l s e :</p>
        <p>c o u n t +=1
r e t u r n c o u n t
pattern and the diference between matching code in the
high-scoring and low-scoring groups.</p>
        <p>We discovered several patterns that could reveal
potential bad practices or errors in students’ code. An example
of such a bad practice is illustrated in Listing 2. This
code corresponds to a match of the second pattern that
was found. The corresponding XML pattern and AST
representation are shown in Figure 2.</p>
        <p>The part of the code that matches the pattern is high- found by the miner. One such error is the overuse of
lighted in red in Listing 2. The pattern matches code nested loops. Even if we found some cases where
stufragments where students hard coded the list of prime dents achieved to create correct code using more than
factors to check. We found 6 students’ code that matched 2 nested loops, we did observe that such code becomes
this pattern. A high-level feedback that the teacher may quite hard to read and is to be avoided.
want to add for this pattern is that computing those num- It should also be said that not all patterns that were
bers is probably a more generic solution than hard-coding mined were of interest. Some of them do capture
rethem in a list. curring Python code fragments but do not necessarily</p>
        <p>Listing 3 is an occurrence of another pattern showing provide any information on the student’s code quality.
a student’s code that is close to a good solution. Unfortu- For example, pattern number 60 described a code
fragnately the student should have used a while-statement ment containing an if with a return inside followed
instead of an if-statement at line 9. This pattern was by a return at the end of the function. That by itself
found in the code of 8 students. A teacher may want to doesn’t say anything about whether the code is good
provide as feedback to these students that they may have or bad. It is the teacher’s job to filter, from all patterns
confused an ‘if’ for a ‘while’. discovered by the miner, the most relevant ones worth</p>
        <p>In addition to analysing the discovered patterns them- checking.
selves, the mere fact of having to walk through code
fragments that match certain patterns, sometimes leads
to the discovery of other recurrent errors that were not</p>
        <p>Listing 4: Example of test block generated for the pattern 2 that will search a function definition inside all blocks in the AST
4.2. Test generation blocks is shown in Listing 4. In the first block (lines
2-5), we search for all instances of a ‘body’ node in the
The FREQTALS algorithm identified a total of 147 fre- code. If no ‘body’ nodes are found, the pattern cannot
quent patterns in the code submitted by both groups of be matched (lines 3-4). Then, we iterate through each
students. The patterns can occur in one or both the high- identified ‘body’ node (line 5). The second block is visible
scoring and low-scoring groups. When a same pattern from lines 6 to 9.
occurs in the high-scoring group or in both groups at The first block employs the ast.walk()3
functhe same time, it often means that the pattern is either tion to search for the root node of the pattern
anya part of or close to a good solution (such as the pattern where in the code, while the second block uses
depicted in Listing 3) or an uninteresting pattern. When ast.iter_child_nodes()4 to search only within the
a pattern can only be found in the low-scoring group, it direct child nodes of the currently matched node. A
is often a pattern representing an error or bad practice return statement only appears in the first block, while
(such as the pattern depicted in Listing 2) but sometimes the continue statement on line 8 allows us to fall back to
also a pattern of no interest. For each of the discovered the next matching node. This same structure is followed
patterns, our tool could produce a unit test. by all of the other test blocks as well.</p>
        <p>It should be noted that out of the 147 frequent code pat- We believe that, to some extent, this block structure
terns found, about two thirds were patterns too generic allows us to reach an acceptable trade-of between
readto be useful (for example a pattern that just matches the ability and correctness of the test. We do think that,
presence of two assignment statements in the code). Of once a teacher understands the block structure of the
the remaining patterns, about half were representative generated unit tests, the tests remain suficiently
underof code fragments occurring in good solutions and the standable. Adapting the tests slightly to capture slight
other half were patterns occurring mostly in the bad solu- variations of the pattern is possible, although not trivial.
tions. Some patterns often tend to be quite close to other
patterns as well, so in the end the teacher kept only a
handful of relevant patterns to be transformed into unit 4.3. Feedback creation
tests. For our dataset consisting of student submissions for an</p>
        <p>Apart from this problem of having to wade manually exam question, we created 4 unit tests generated from or
through the many patterns found to retain only a few inspired by the mined patterns. We ofered this question,
relevant ones, we also faced another issue which was with the additional unit tests adorned with the dedicated
to ensure that the Python code generated for unit tests feedback we added, to new students as a revision exercise
remained understandable to teachers. This is desirable to help them prepare for an upcoming exam session. Of
as it allows teachers to understand the pattern just by the 4 unit tests that we added, 3 of them were generated
looking at the Python code. It also allows them to adapt directly from the patterns found by FREQTALS, including
the tests afterwards if need be. This is not easily achieved, the two patterns shown in red in Listings 2 and 3. The
however, since we want our generated unit tests to match third pattern that we used was one which represents the
exactly the same code fragments as the patterns found by usage of return inside of loops. The last one did not
the miner and since this matching is non-trivial, amongst correspond to a mined pattern but rather to a recurrent
others due to the need to for backtracking.</p>
        <p>To keep the tests understandable, we construct them
using test blocks. An example of two combined test</p>
      </sec>
      <sec id="sec-1-8">
        <title>3https://docs.python.org/3/library/ast.html#ast.walk</title>
        <p>4https://docs.python.org/3/library/ast.html#ast.iter_child_
nodes
error that we discovered manually in the students’ code
while analysing the dataset, as explained in Section 4.1.</p>
        <p>At the time of writing, about 66 students have given
this revision exercise at least one try. Of those, 23
students received at least one of the 4 dedicated feedbacks
that we added to this exercise. We analysed the
submission of those 23 students and saw that the given feedback
was useful for at least 6 of them. Indeed, we could
observe that, after they took the feedback into account, the
students obtained a better score for the question and
didn’t match the pattern anymore after they corrected
their code.</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>5. Conclusion</title>
      <p>In this paper we explored whether we could identify
recurrent patterns highlighting errors or bad practices in
students’ code and use them to create more advanced
feedback to be included in an autograder. We did so by
ifrst mining the solutions of a large amount of students
for frequently occurring patterns using the FREQTALS
tree mining algorithm. We then manually analysed each
of these patterns to find those that match typical
mistakes that many students seem to make in their solution.
We also found that for some of these patterns, creating
advanced feedback when they are detected could help
student produce code of better quality.</p>
      <p>Our second step was then, for the selected patterns, to
automatically extract unit tests that match exactly the
same source code as the mined patterns. These unit tests,
which would thus check for typical coding errors made
by students, could then be extended by a teacher with
a more personalized feedback on that particular kind
of error. These unit tests can then be included in an
autograder so that students making these mistakes get
more accurate feedback on the errors they make.</p>
      <p>As both the FREQTALS miner and the test generation
tool in essence only require an AST to mine for patterns
and generate the tests, these tools can easily be applied
to other programming languages than Python as well.</p>
      <p>In this work in progress paper, we mainly presented
our vision and the current implementation of our tool
prototype. Although we have started exploring the usage
of this tool on real data, for now the purpose of the
validation was more to ensure that the tool has potential and
is working correctly. Obviously a more thorough and
in-depth validation on more data and with real students
and teachers is still needed. On the one hand we want to
study and understand how teachers use the tool and how
the tool can be improved further to satisfy their needs.
On the other hand we want to validate that the students
efectively benefit from the improved feedback provided
by the generated test suites (i.e., better scores and better
code quality).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>G.</given-names>
            <surname>Derval</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Gego</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Reinbold</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Frantzen</surname>
          </string-name>
          ,
          <string-name>
            <surname>P. Van Roy</surname>
          </string-name>
          ,
          <article-title>Automatic grading of programming exercises in a MOOC using the INGInious platform, European Stakeholder Summit on experiences and best practices in and around MOOCs (EMOOCS'15) (</article-title>
          <year>2015</year>
          )
          <fpage>86</fpage>
          -
          <lpage>91</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>H. S.</given-names>
            <surname>Pham</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Nijssen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Mens</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. D.</given-names>
            <surname>Nucci</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Molderez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. D.</given-names>
            <surname>Roover</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Fabry</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Zaytsev</surname>
          </string-name>
          ,
          <article-title>Mining patterns in source code using tree mining algorithms</article-title>
          , in: International Conference on Discovery Science, Springer,
          <year>2019</year>
          , pp.
          <fpage>471</fpage>
          -
          <lpage>480</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>K.</given-names>
            <surname>Mens</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Nijssen</surname>
          </string-name>
          , H.-S. Pham,
          <article-title>The good, the bad, and the ugly: mining for patterns in student source code</article-title>
          ,
          <source>in: Proceedings of the 3rd International Workshop on Education through Advanced Software Engineering and Artificial Intelligence</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>8</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>H.</given-names>
            <surname>Keuning</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Jeuring</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Heeren</surname>
          </string-name>
          ,
          <article-title>Towards a systematic review of automated feedback generation for programming exercises, Association for Computing Machinery</article-title>
          , New York, NY, USA,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>S.</given-names>
            <surname>Narciss</surname>
          </string-name>
          ,
          <article-title>Feedback Strategies for Interactive Learning Tasks,</article-title>
          <year>2008</year>
          , pp.
          <fpage>125</fpage>
          -
          <lpage>144</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>G.</given-names>
            <surname>Haldeman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Tjang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Babeş-Vroman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Bartos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Shah</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Yucht</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. D.</given-names>
            <surname>Nguyen</surname>
          </string-name>
          ,
          <article-title>Providing meaningful feedback for autograding of programming assignments, Association for Computing Machinery</article-title>
          , New York, NY, USA,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>R.</given-names>
            <surname>Singh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Gulwani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Solar-Lezama</surname>
          </string-name>
          ,
          <article-title>Automated feedback generation for introductory programming assignments</article-title>
          ,
          <source>in: Proceedings of the 34th ACM SIGPLAN conference on Programming language design and implementation</source>
          ,
          <year>2013</year>
          , pp.
          <fpage>15</fpage>
          -
          <lpage>26</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>S.</given-names>
            <surname>Gulwani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.</given-names>
            <surname>Radiček</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Zuleger</surname>
          </string-name>
          ,
          <article-title>Automated clustering and program repair for introductory programming assignments</article-title>
          ,
          <source>ACM SIGPLAN Notices</source>
          <volume>53</volume>
          (
          <year>2018</year>
          )
          <fpage>465</fpage>
          -
          <lpage>480</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>A.</given-names>
            <surname>Nguyen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Piech</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Guibas</surname>
          </string-name>
          , Codewebs:
          <article-title>Scalable homework search for massive open online programming courses</article-title>
          ,
          <source>in: Proceedings of the 23rd International Conference on World Wide Web, ACM</source>
          ,
          <year>2014</year>
          , pp.
          <fpage>491</fpage>
          -
          <lpage>502</lpage>
          . doi:
          <volume>10</volume>
          .1145/2566486. 2568023.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>T.</given-names>
            <surname>Asai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Abe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kawasoe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Sakamoto</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Arimura</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Arikawa</surname>
          </string-name>
          ,
          <article-title>Eficient substructure discovery from large semi-structured data</article-title>
          ,
          <source>IEICE TRANSACTIONS on Information and Systems</source>
          <volume>87</volume>
          (
          <year>2004</year>
          )
          <fpage>2754</fpage>
          -
          <lpage>2763</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>