Improving Short and Long-term Learning in an Online Homework System Ben Prystawski Jacob Nogas University of Toronto University of Toronto ben.prystawski@mail.utoronto.ca jacob.nogas@mail.utoronto.ca Andrew Petersen Joseph Jay Williams University of Toronto University of Toronto andrew.petersen@utoronto.ca williams@cs.toronto.edu ABSTRACT Online homework systems are common in university courses. While scientific findings about learning could have bearing on how instructors design these systems, there is little guid- ance available for instructors on the problem of extrapo- lating scientific results in various contexts to make design decisions in specific settings. This paper leverages the value of online environments to conduct randomized experiments that directly test principles in a real-world introductory pro- gramming course. We investigate the relative benefit of giv- ing students explanations of the correct solution to a prob- lem and giving them an additional problem. We find sugges- tive evidence that students do better on subsequent prob- lems in the same exercise when given an explanation, but they do better on a post-test two weeks later when given an additional practice problem. These results can inform instructors’ decisions in designing online homework. Figure 1: Screenshot of the problem on which the student supports were deployed in the PCRS home- Keywords work system education; programming; explanation; online homework; ex- periment; MOOC While much work in educational data mining has focused 1. INTRODUCTION AND RELATED WORK on extracting and analyzing data students generate as they Many university courses use online homework systems to naturally interact with these systems, we diverge from that give students practice material. These systems enable stu- trend in this paper by deliberately embedding a random- dents to conveniently practice their skills and instructors ized experiment in an online homework environment. We to automatically grade homework and gather data on stu- believe that randomized experiments can provide valuable dent performance. They typically consist of problems with insight for computing education researchers and practition- either written or multiple-choice responses for students to ers because they can directly test the impact of different complete. educational interventions. Past research has investigated the effects of these supports The ICAP framework [2, 3] provides a theoretical basis for on student learning. There is experimental evidence, for thinking about different educational methods by grouping example, that explanations and practice problems can help them into levels based on the depth of students’ engagement student learning under certain circumstances [5, 6, 8]. How- with the material. The levels are, from most to least engag- ever, instructors still must solve the considerable problem ing: interactive, constructive, active, and passive methods. of how to translate this research to a real-world setting. Using this framework, one might expect the active learn- ing approach of solving problems to be more effective for students’ understanding than the passive approach of read- ing more explanations. Students must engage with practice problems at a deeper level then explanations, so practice problems might produce better learning outcomes. Compar- ing the effectiveness of these two methods enables us to test the active-passive boundary within the ICAP framework. Copyright c 2020 for this paper by its authors. Use permitted under Cre- ative Commons License Attribution 4.0 International (CC BY 4.0). While the ICAP framework might lead one to predict that Repeat Chars First Fifty Nested For and Range How Many - Two For Loops How Many - Nested For Loops Repeat Chars Repeat Chars v2 First Fifty Nested For and Range v2 How Many - Nested For Loops Figure 3: Names of problems in week 10 (top) with corresponding problems in week 12 (bottom). Blue lines indicate correspondence between prob- lems. Problems with the same name are identi- cal, while problems ending in “v2” are very similar but with minor differences such as different variable names. Learning supports were all deployed on the problem “Repeat Chars” in week 10. Figure 2: An additional problem given to some stu- ready have some knowledge about a subject, providing ad- dents as a follow-up exercise in the online homework ditional explanations instead of other knowledge-reinforcing system activities can be detrimental to learning [12]. There has also been considerable research on prompting students to write their own explanations in a laboratory setting, finding that an additional problem should be more helpful to learning having students write their own explanations of key course than an explanation, this could be confounded by the fact concepts can help learning [10, 4]. that the additional problem is optional. If students spend enough time thinking about and attempting the problem, Similarly, the effects of solving problems on learning can be it should improve their understanding beyond the improve- varied and complex under different circumstances. For in- ment they would see from reading an explanation of the stance, there is a large body of research on the design of solution. However, it is also possible that students will ded- intelligent tutoring systems which automatically determine icate less time and attention to the additional problem than which practice problems to show learners and in what order they would to the explanation as trying to solve a problem is to improve their understanding most effectively [1]. How- a more daunting task than reading an explanation. Further- ever, mathematics and computer science education research more, there is the variable of time to improvement. Perhaps point to the challenges in assuming additional practice of students will not see any immediate benefit from trying an problems is always helpful, as sometimes it is a poor use of additional problem, but doing it will help their learning in students’ time, or leads them them to focus on procedural the longer term by cementing their understanding of the knowledge, instead of understanding the underlying princi- concept the problem tests. Will students benefit from addi- ples [7, 9]. tional homework immediately, or will the improvement affect how well the student remembers that week’s material later? These studies have shown that even in a controlled labo- Both hypotheses appear plausible. Likewise, one might ex- ratory environment, the effects of these intuitively-helpful pect that additional explanations will not have a significant interventions vary significantly. Instructors seeking to apply effect on student learning. They fit into the passive category these findings to their courses face the problem of translat- of the ICAP framework, which is the lowest level of engage- ing findings in laboratory experiments to design decisions ment. The explanation is also optional to read, so students about what kind of support to provide in online problems might ignore it entirely. However, one might also expect that and other educational environments. reading a well-written explanation of a concept will deepen a student’s understanding of the concept they are being tested Many counterintuitive effects have been found in prior edu- on. Furthermore, it could be the case that students forget cation research, so it is essential to empirically study the ef- the explanation as soon as they finish working on the ex- fects of interventions before recommending them to instruc- ercise, but it could also improve their understanding over a tors. In this paper, we extend upon past literature about the longer period of time, similar to how they remember what role of reading explanations and solving problems in learn- they learned from lectures when completing homework. ing and provide empirical evidence on how these forms of student support affect learning in a real-world setting. Ul- There is evidence that providing students with instructional timately, we hypothesize that students will perform better explanations when they are solving problems can benefit on subsequent tests of their understanding when they are learning – which is intuitive. However, these explanations shown an additional problem compared to when they are are not always effective, especially when they merely give shown an explanation. This hypothesis is motivated by the away the answer rather than help students come to see how active-passive boundary from the ICAP framework. to solve a problem [8, 11]. For instance, when learners al- Short-term difference in # attempts 0.2 Short-Term Effect of Explanation 0.2 Long-Term Effect of Explanation Long-term difference in # attempts 0.1 0.1 0.0 0.0 −0.1 −0.1 −0.2 −0.3 −0.2 Mean Mean = 0.034 Mean = -0.029 Mean = -0.062 −0.4 n ==337 -0.305 Mean n ==155 -0.098 Mean n ==156 -0.095 n = 262 n = 104 n = 112 none short long none short long Explanation Explanation Figure 4: Effect of giving an explanation on number Figure 5: Effect of giving an explanation on num- of attempts to get the right answer on subsequent ber of attempts to get the right answer on the problems in the same exercise. T-test results: none same problem given two weeks later. T-test re- vs. short: (t(490)=-1.24, p=0.215), none vs. long: sults: none vs. short: (t(364)=0.300, p=0.764), (t(491)=-1.29, p=0.195), short vs. long: (t(309)=- none vs. long: (t(372)=0.433, p=0.665), short vs. 0.0263, p=0.979) long: (t(214)=0.126, p=0.900) ber the exact answers by week 12, so these were non-trivial 2. METHODS measures of learning. Figure 3 shows the names of problems The context for the experiment on explanations and addi- in week 10 and the corresponding problems in the week 12 tional practice problems was the Programming Course Re- follow-up activity. All of these problems were focused on source System (PCRS) online homework system for the in- analyzing the runtime of Python programs. troductory computer programming course at the University of Toronto. This course spans one twelve-week semester and students are given for-credit online homework exercises each 2.1 Experimental Factors week. The problem we deployed the supports on is shown We experimentally varied two variables in a factorial exper- in Figure 1. It asks students to analyze the runtime of a for iment. Each time the student submitted an answer to the loop in Python. first problem of the exercise, they were randomly assigned to a condition for the Explanation factor and the Additional A total of 648 students completed the homework in week Problem factor. 10 of the course. 478 of these students also completed the optional follow-up exercise in week 12. There were 5 prob- The Explanation factor had three levels: absent (none), lems in each week. This choice of weeks ensures that there short, or long. The short explanation states, “The third is considerable delay between the initial intervention and answer is correct because the code inside the for loop takes subsequent measurement, enabling us to measure long-term constant steps regardless of len(s) and it will be executed learning. len(s) times.” The long explanation states “Suppose s = ‘cat’. Then, double = double + ch * 2 will be executed 3 In the experiment, after students attempted a homework times because the for loop iterate through each character of problem in week 10 of the course, we used a factorial de- s (i.e. ‘c’, ‘a’ and ‘t’). Now, suppose s = ‘google’. Then sign to independently vary two factors: whether an expla- double = double + ch * 2 will be executed 6 times. As you nation was provided and whether an additional problem was can see, if len(s) doubles, the number of steps also doubles. provided. The experiment was performed in the context of So, the third answer is correct.” a multiple-choice problem pertaining to run-time analysis, shown in Figure 1. Students were given course credit for The Additional Problem factor had two levels: absent (none) completing the main problem, but did not have any direct or present (one additional problem that was very similar to incentive to read the explanation or attempt the additional the problem students had attempted in asking them to trace problem. through a for loop and determine its time complexity). A screenshot of this problem is shown in Figure 2.1 To measure the impact on learning over a longer time frame, we designed a post-test with problems that were either iden- To measure how well a student performs on a problem, we tical to or variants of the problems asked in week 10. We used the number of attempts until the first correct answer. gave these problems to students two weeks after the experi- This is simply the number of submissions made before the ment (week 12). Some follow-up problems were identical to 1 the corresponding week 10 problems and others had features These factors were varied in the context of a larger exper- iment with more factors that will not be described in this of the problem changed, such as having a loop executing 50 paper in the interest of space. We used weighted randomiza- times rather than 30. Students were not at ceiling perfor- tion in favour of not showing students additional activities mance in the post-test, suggesting that they did not remem- to avoid overwhelming them with too many activities 0.2 Short-Term Effect of Additional Problem Long-Term Effect of Additional Problem 0.4 Short-term difference in # attempts Long-term difference in # attempts 0.1 0.3 0.0 0.2 −0.1 0.1 0.0 −0.2 −0.1 Mean = -0.199 Mean = -0.222 Mean n ==367 -0.079 Mean n ==111 0.252 −0.3 n = 486 n = 162 −0.2 none additional problem none additional problem Additional Problem Additional Problem Figure 6: Effect of giving an additional problem on Figure 7: Effect of giving an additional problem on number of attempts to get the right answer on subse- number of attempts to get the right answer on the quent problems in the same exercise. T-test result: same problem given two weeks later. T-test result: (t(646)=0.158, p=0.874) (t(476)=-1.602, p=0.110) student selects all of the correct options and none of the 3.1 Minor improvement on the same problem incorrect options on the multiple-choice problem. For ex- Students took only slightly fewer attempts to get the prob- ample, if a student gets the problem correct on their first lem correct in week 12 compared to week 10. While they try, their number of attempts is 1. If they get the first at- took 2.07 attempts on average to get the answer right in tempt wrong but the second attempt right, their number of week 10, they took 2.02 attempts to get the answer right on attempts is 2. week 12, even though they had completed the same problem just two weeks before. We found little evidence that students To measure short-term improvement, we took the difference improved between solving a problem in week 10 and solv- between the number of attempts on problem 1 of the week ing the same problem in week 12 (t(1136)=-0.529, p=0.597). 10 exercise and the average number of attempts for the re- This suggests that students might not have remembered the maining four problems in that exercise. This number can answer to the problem even when they already solved it two be negative if students did worse on the remaining problems weeks earlier, meaning testing them on the same problem in than they did on the problem we deployed the supports on, week 12 appears to be a non-trivial measure of their under- and the higher the number, the greater the improvement. standing of the same concepts. To measure improvement on the delayed exercise, we took 3.2 Explanations might have helped in the short the difference between the number of attempts on problem term 1 of the week 10 exercise and the number of attempts on the We found no statistically significant difference between stu- exact same problem when presented in the week 12 follow- dents who were given explanations and those who were not. up. However, the results suggest that when students were given explanations, they took slightly fewer attempts to get the right answer in subsequent problems than those who did 3. RESULTS AND DISCUSSION not, regardless of whether they saw a short (t(490)=-1.24, In this section, we first show a lack of evidence for an im- p=0.215) or long explanation (t(491)=-1.29, p=0.195), as provement in performance between the problem we added shown in Figure 4. However, the effect of seeing explana- student support to in week 10 and the same problem given in tions was much smaller in the long term, as the sample a follow-up exercise in week 12. Next, we analyze the effects means were similar in all three conditions. This is shown of the Explanation and Additional Problem factors. Our in Figure 5 and suggests that an explanation in a homework results did not reach the significance threshold of p < 0.05, context might be useful only during that homework session. and as such they should be interpreted with caution. We This could have happened because the problems tested a present suggestive evidence that the explanations were help- procedural skill, namely runtime analysis. While reading an ful on the same homework exercise (t(490)=-1.24, p=0.215), explanation gives students a clear formula they can apply but not on the follow-up test two weeks later (t(364)=0.300, in subsequent runtime analysis, they might forget that for- p=0.764). Finally, we show the reverse trend with the Ad- mula when they stop working on their homework and lose ditional Problem: it was not helpful in the same homework the benefit of the explanation. exercise (t(646)=0.158, p=0.874) but might have been in the follow-up exercise two weeks later (t(476)=-1.602, p=0.110). We interpret how these results can inform instructors’ de- 3.3 Additional Problems might have helped in sign choices and address possible limitations of the work. the long term Similarly to the Explanation factor, we did not find statis- tically significant evidence for a difference in means for the Additional Problem factor. However, we still found sugges- problems helps them develop lasting procedural knowledge tive evidence that giving an additional problem has an effect of how to analyze the runtime of an algorithm, it is not in the long term. We did not find evidence for a difference clear that we can conclude the same about different tasks between the performance of students who were shown an in computer science education like learning the syntax of a additional problem and those who were not on subsequent programming language or how to design an algorithm. problems in the same homework exercise, but students who received the additional problem took fewer attempts in the Finally, the week 12 post-test was optional, so dropout is post-test (t(476)=-1.602, p=0.110). These results are shown a concern. While 648 students completed the exercise in in Figures 6 and 7 respectively.2 This difference might sug- week 10, only 478 completed the follow-up post-test in week gest that the value of the additional practice problem was 12. Therefore, the conclusions we draw about the effects primarily as a memory aid. Doing more problems could have of educational supports in the long term might reflect only helped students remember the skill they learned better when the population of students who choose to complete the post- writing the post-test. If this knowledge is already in their test. Though this was the majority of students, the reported minds when they are doing the exercise, it makes sense that effect could be different if, for instance, the students who are they did not benefit immediately from more practice. How- unlikely to do optional homework problems in week 12 are ever, they might remember more when writing the post-test, also unlikely to attempt an optional problem given to them which would explain the improvement in performance there. in their week 10 homework exercise. 3.4 Limitations 4. CONCLUSION AND FUTURE WORK A notable limitation of this work was the lack of statistical Our experiment investigated the effects of explanations and signficance. However, the results are consistent with each additional problems on performance both on a post-test and other and align with ideas from the ICAP framework. As subsequent problems on the same test. We found intrigu- such, they suggest a trend that could inform future research. ing but not definitive insight into the effects of explanations. In the interest of open and replicable science, it can be valu- The mean number of attempts for students who saw either a able to publish suggestive and negative results that do not short or long explanation was lower than those who saw no meet the threshold for statistical significance. Real-world explanation, but this difference was not statistically signifi- data is often messy and suggestive results can reveal crucial cant (t(490)=-1.24, p=0.215). It is possible that the expla- new directions for analysis. nations we showed students simply did not have an effect on their learning in either the short or long term. It could be Another limitation of this work is that the problems in the that the explanations used in this experiment did not bene- week 12 follow-up were not all identical to those in the week fit students as much as they could have and effort should be 10 homework. They tested the same concepts and some were directed to designing better explanations. Alternatively, it exact copies, but others were slight modifications of prob- is possible that the explanations helped students somewhat lems on the original homework. Therefore, the observed re- on the remaining problems in the homework exercise. If this sults might be due to the supports having different degrees result were replicated in a larger study, it would be inter- of relevance to the problems in week 10 and week 12 rather esting because it could guide instructors in deciding how to than the duration between support and post-test. We have effectively incorporate instructional explanation into their mitigated this by using the differences between number of courses. attempts on the problem we applied the explanations and additional problem to and the relevant subsequent problems In exploring the effect of additional problems, we found that as dependent variables, so if one intervention improves stu- the mean number of attempts on the equivalent post-test dents’ score on problem 1 both in week 10 and week 12, that problem was lower for students who were shown an addi- would be reflected in that the changes to both scores cancel tional problem than those who were not. This difference out when the difference is computed. was not statistically significant, though we found stronger evidence for it than we found for explanations (t(476)=- One might also raise the concern that we had different sam- 1.602, p=0.110). Like with the explanations, it is possible ple sizes in different conditions. More students were assigned that the additional problem we gave students was truly not to the “none” condition than other conditions for both the effective and future work should focus on how to design more Explanation and Additional Problem factors. We intention- effective practice problems. However, if the long-term im- ally weighted the randomization in this way to minimize the provement as a result of the additional problem is replicated burden on students from having too many additional ac- in subsequent large-scale experiments, it could provide guid- tivities, a strategy used in randomized clinical trials in the ance for instructors in deciding how to incorporate practice medical field. problems into their courses effectively. Considering that the effects of reading explanations and If the results reported above reflect a real effect, they sug- solving problems might vary widely with context, such as gest that explanations are helpful in the short term, but not the week of a course in which supports were given, it is un- in the long term. Conversely, additional problems are help- clear how broadly the trends we identify in our data apply. ful in the long term, but not in the short term. This aligns While it appears possible that giving students more practice with what one might expect based on the ICAP framework, 2 After this analysis, we noticed that the control and exper- as solving a problem qualifies as deeper engagement with imental groups had different variances, which violates the the learning material than reading an explanation. Instruc- assumption of the standard t-test. We then ran Welch’s t- tors likely care more about whether their students retain test and found a p-value of 0.07. (t(476)=-1.813, p=0.0712) information in the long term rather than whether they un- derstand concepts immediately, so focusing on the long-term 5. REFERENCES learning measure makes sense. [1] C. J. Butz, S. Hua, and R. B. Maguire. A web-based bayesian intelligent tutoring system for computer The possible difference between the effects of these inter- programming. Web Intelligence and Agent Systems: ventions is interesting and motivates further research into An International Journal, 4(1):77–97, 2006. how immediate and delayed effects of reading explanations [2] M. T. Chi. Active-constructive-interactive: A and solving problems might differ. This might help guide conceptual framework for differentiating learning instructors in thinking about the trade-offs involved in de- activities. Topics in cognitive science, 1(1):73–105, ciding when to give explanations to students and when to 2009. give them more problems. [3] M. T. Chi and R. Wylie. The icap framework: Linking cognitive engagement to active learning outcomes. Future work should investigate how generally this pattern Educational psychologist, 49(4):219–243, 2014. holds. The part of the course we deployed these supports [4] J. L. Chiu and M. T. Chi. Supporting self-explanation on focused on the procedural skill of reading an algorithm in the classroom. Applying science of learning in and analyzing its time complexity. Would giving an addi- education: Infusing psychological science into the tional problem still be effective in teaching a different con- curriculum, pages 91–103, 2014. cept in the course, such as the difference between for and [5] M. Feng, N. T. Heffernan, and J. E. Beck. Using while loops? Perhaps additional problems are more helpful learning decomposition to analyze instructional in developing procedural knowledge while good explanations effectiveness in the assistment system. In AIED, 2009. might be more effective in building propositional knowledge. [6] R. Hosseini, T. Sirkiä, J. Guerra, P. Brusilovsky, and By running similar experiments at different points in the in- L. Malmi. Animated examples as practice content in a troductory computer science course, we hope to learn more java programming course. In Proceedings of the 47th about which types of student support are helpful in devel- ACM Technical Symposium on Computing Science oping different skills. Education, SIGCSE ’16, page 540–545, New York, NY, USA, 2016. Association for Computing Machinery. Additionally, we are interested in investigating whether the effects of these interventions differ across subgroups of stu- [7] J. Kay, M. Barg, A. Fekete, T. Greening, O. Hollands, dents. One reason why we might not see a large average J. H. Kingston, and K. Crawford. Problem-based effect is that the effectiveness of different forms of support learning for foundation computer science courses. could vary significantly across students. Perhaps, for exam- Computer Science Education, 10(2):109–128, 2000. ple, students who take a programming course out of intrinsic [8] D. S. McNamara, T. O’Riley, and R. S. Taylor. interest are more likely to benefit from an additional prac- Classroom based reading strategy training: tice problem than those who take it to satisfy a breadth re- Self-explanation vs. a reading control. In Proceedings quirement. By analyzing this experimental data jointly with of the Annual Meeting of the Cognitive Science contextual variables derived from surveys and data mining, Society, volume 28, 2006. we hope to provide a richer picture of which forms of sup- [9] N. Rummel, M. Mavrikis, M. Wiedmann, K. Loibl, port work for which students and how instructors can tailor C. Mazziotti, W. Holmes, and A. Hansen. Combining interventions more precisely to individual students’ needs. exploratory learning with structured practice to foster conceptual and procedural fractions knowledge. Singapore: International Society of the Learning Sciences, 2016. [10] J. J. Williams and T. Lombrozo. The role of explanation in discovery and generalization: Evidence from category learning. Cognitive Science, 34(5):776–806, 2010. [11] J. Wittwer, M. Nückles, and A. Renkl. Improving human tutoring by improving tutor-generated explanations. In Avoiding Simplicity, Confronting Complexity, pages 359–368. Brill Sense, 2006. [12] J. Wittwer and A. Renkl. Why instructional explanations often do not work: A framework for understanding the effectiveness of instructional explanations. Educational Psychologist, 43(1):49–64, 2008.