Designing an Intelligent Tutoring System Across Multiple Classes [Extended Abstract] Laura O. Moraes Carlos Eduardo Pedreira Systems and Computing Department Systems and Computing Department (COPPE/PESC) (COPPE/PESC) Universidade Federal do Rio de Janeiro (UFRJ) Universidade Federal do Rio de Janeiro (UFRJ) lmoraes@cos.ufrj.br pedreira56@gmail.com ABSTRACT it can be used in a large variety of projects. This can be The ability to understand a person’s knowledge is impor- great to stimulate students, since they can work in projects tant in educational settings. This can be used to recognize they actually relate to. Python is also user-friendly and for gaps in knowledge and to diagnose misunderstandings and the past seven years, it has been the fastest-growing major misconceptions. This paper presents an intelligent tutoring programming language [17], being correlated with trending system created to gather student knowledge data and the careers, such as DevOps and Data Scientist [20]. However, proposed methodology to generate the datasets. We asked according to the 2015 review [6], only 11% of the Educa- 14 professors to determine the concepts found in a set of tional Data Mining and Learning Analytics papers about problems and we compare the student behavior found in programming courses reported using Python as the course each methodology. language. Keywords These factors motivate our objective: the creation, deploy- open datasets, intelligent tutoring system, educational data ment and use of online intelligent systems in Introduction to mining, computer science education Programming classes using the Python language as a way to uncover students’ difficulties, to understand their knowledge and to provide timely feedback to keep students engaged. 1. INTRODUCTION In this work, we have two main motivations: create an online The main contributions of this paper are: 1) generating an learning environment to gather data about students’ knowl- anonymous open-source students interaction database using edge and stimulate students to learn the Python language. the Python language with corresponding solutions and, 2) developing the Machine Teaching system1 , an ITS with a It is important to understand why students succeed or fail built-in recommendation engine. when taking a course so we can improve teaching methods by identifying students’ needs and to provide personalized education. Smart learning content is defined as visualiza- 2. DATA ACQUISITION METHODOLOGY tions, simulations and web-based environments that pro- This section presents the web system developed to acquire vide outputs for students based on the students’ input [2]. student data and interact with students and educators. The The adoption of smart learning content in classrooms and system architecture is proposed as an improvement from a in self-learning environments motivates students [1, 7, 11, common architecture from the Intelligent Tutoring System 12], improves student learning, decreases student dropout literature. We also present the methodology used to collect or failure [1, 9, 8, 6] while increasing their self-confidence, student data. especially in female students [9]. 2.1 System Architecture Also, Python is a general-purpose language, which means To capture students’ data, a web-system was designed and ∗Copyright c 2020 for this paper by its authors. Use per- implemented. Intelligent Tutoring Systems (ITS) are sys- mitted under Creative Commons License Attribution 4.0 In- tems designed to assist the tutoring of students on a per- ternational (CC BY 4.0). sonalized level. They have been around since the 1960s [22], having its first formal definition in the 1990s [18]. Nowa- days, there are several different ITS covering a broad range of subjects with success used by hundreds of thousands of students a year. Ihantola et al. [6] present a table showing a list of them and their supported programming languages. Ihantola et al. [6] also define a common architecture for these systems in their 2015 review. Common front-end features 1 http://www.machineteaching.tech among them are an IDE for students to write, edit and ex- Python web tutorials: Practice Python [15], Python School [19], ecute code, a submission interface (can be embodied in the Python Programming Exercises [5], and W3Resource [23]. IDE or in a separate part of the system), feedback for stu- The chosen sources provided the exercise statements along dents’ actions and visualization schemes for teachers and with the exercise code solution. Students’ responses to these researchers. Back-end usually supports saving data in some exercises were collected in 10 different Introduction to Pro- kind of storage, usually a relational database. However, part gramming courses during 2 semesters. Students were as- of this proposal is to build an integrated personalized exer- signed two different strategies: either the system showed cise recommendation engine within the system. Therefore, random problems or they would follow a predefined path. none of the existing systems could be used without modifi- The students were introduced to the system at the end of cations. To avoid legacy code and to have a better control the semester before their finals exams, or at the beginning of the desired features and captured student data, it was of the next semester. In both cases, the exercises were sup- decided to build one from scratch using open-source Python posed to act as revision exercises of all CS1 content. and Javascript libraries. In addition, the classes in which the system was employed, use a modularization based approach In our second approach, we accompanied the students through- to teach imperative programming [4]. Each question requires out a whole semester in four different Introduction to Pro- self contained modules of code as answer which translates gramming courses (they had the same syllabus, but different in Python as functions. So, the proposed system also had professors). Every week, an exercise list concerning the sub- to take this format into consideration to correct students’ ject given in class was available in the system. They had assignments. Besides these adaptations, it shares most of one week as a deadline to finish them and their performance the functionalities with the other systems. Fig. 1 illustrates on these lists composed part of their final grade. In total, an abstract view of the system, adapted from Ihantola et they had to solve 65 problems. al. [6]. The main differences are the built-in recommenda- tion engine, that controls what exercises are shown in the 3. RESULTS IDE for the student to solve, the way of correcting students This section presents statistics for both datasets, comparing assignments and the extra collected data, like time spent the different behaviors found in them. typing and the total time spent solving the question. 3.1 Revision Dataset In total, there are 3,632 records from 192 students with an average of 18.4 attempts in 4.4 problems. Also, the dataset is imbalanced: there are 764 (21%) successful attempts and 2,868 (79%) failed attempts. It means that, on average, each student attempts a problem 4 times before getting all test cases correct in the fifth. Some simple statistics for the dataset are shown in Table 1. Fig. 2 is a histogram showing the distribution of success and failures per prob- lem. Similar behavior is found in the distribution of success and failures attempts per student. Both success attempt dis- tributions have smaller variance and smaller mean than the corresponding fail attempt distributions. This is expected since the students are given several tries to submit a correct response before moving on to the next problem. Figure 1: Abstract view of the system architecture. Table 1: Revision dataset statistics The exercises currently available in the system belong to Avg Median Min Max either Applying or Creating categories in Bloom’s Taxon- attempts per question 75.67 55 10 304 omy [21]. The students are presented with a problem and attempts per student 18.44 11 2 266 they should write the expected answer in a free-text cod- different students per 18.17 14 4 71 ing format. For each exercise, a test case function generator question was defined to correct the results. The students get feed- different questions per 4.43 3 1 44 back every time they submit an answer and they can see student whether they passed or failed a unit test case. If they get all of them correct, the task is considered done and the student For this dataset, we asked 14 professors to indicate the con- may move on to another problem. The system saves a state cepts needed in each question for a student to be able to solve every time a student submits an answer. it based on the exercise solution. The teaching experience of the professors range between 2 to 20 years and they were 2.2 Data Acquisition not necessarily involved in the classes’ participating in this For the data acquisition process, we used the system to col- work. Each professor should associate up to three concepts lect the data in two approaches. The system was either (from 15 available) to 15 randomly assigned code snippets. used a single time for revision purposes (revision dataset), On average, each code received four evaluations. From the and throughout a whole semester (semester dataset). 54 code snippets, 37 of them had one or more concepts in which all the professors agreed. If we lower the threshold The first approach uses 48 CS1 problems crawled from four to contabilize where at least 75% of the professors agree (3 can be used to measure if they are using the system and the exercises to study and review the content, for example. Fig. 3 shows the quantity of students per number of ques- tions histogram. We can notice that only 15 out of the 181 (8%) students attempted more than 59 out of the 65 (92%) different exercises. The exercise distribution is ac- tually quite flat. If we divide it in three equal parts, ap- proximately one third of the class (64 students, 35%) at- tempted only one third of the exercises, one third (60 stu- dents, 33%) attempted two thirds of the exercises and the last third (57 students, 31%) attempted between 45 and 65 exercises. When provided in real time, this information can be used by the professors to find students that are having difficulties finishing exercises and provide personalized assis- tance. Figure 2: Distribution of questions’ success and fail attempts out of 4), this number increases to 53. So, we decided to use this 75% threshold of agreement to relate the concepts and the exercises. The concepts are not mutually exclusive and a problem can be assigned to more than one concept. Around 45% of the problems involved the loop concept and 40% the conditional concept. About 22% involved working with strings and 12% involved math questions. 3.2 Semester Dataset This dataset is 7.5 times bigger in number of attempts than the previous one, containing 27,491 attempts records. How- ever, since it accompanied the same students for an entire semester, the number of students is actually smaller: 181 dif- Figure 3: Distribution of students per number of ferent students. Simple statistics for the dataset are shown questions in Table 2. This dataset has a slightly higher success rate than the previous one, although it is still very imbalanced. It contains 6,849 (24.91%) success attempts against 20,642 4. CONCLUSIONS (75.09%) unsuccessful ones. This can be explained by the In this paper, we presented the Machine Teaching ITS sys- fact that each weekly set of exercises covers mostly what was tem, developed to assist students and professors in modular- seen in class in the same week, so students did not have a lot ized function-based Python classes. We also presented and of time to forget the subject [3, 10, 13, 14, 16], in contrast analyzed the two approaches in which data were collected. to the review exercises that were done before finals or at the For the first approach, we invited professors to associate the beginning of the next semester. needed concepts for each question and calculated their agree- ment rate. In general, they agree in at least two concepts for each exercise. For the second approach, we visualized the Table 2: Semester dataset statistics distribution of the dataset, which can be used by professors Avg Median Min Max during the semester to identify students with difficulties. attempts per ques- 422.94 349 85 1291 tion 4.1 Future work and call for collaboration attempts per student 151.88 114 2 1002 We are currently working on integrating a recommendation different students per 87.22 87 39 145 engine in the system. After this step, we will perform some question A/B tests with professors and students and collect their different questions 31.32 31 1 65 opinion on the available tools. In addition, there is a lot to per student be explored on these datasets. For example, we would like to study the differences in behavior between revision and semester students, investigate student procrastination, infer Another interesting difference between the datasets is the student knowledge based on their answers, research tempo- average number of attempts per question. Whereas in the ral learning effect, among other ideas. We invite researchers revision dataset it is lower than the total number of ques- interested in exploring the dataset to contact the authors. tions, indicating that not all the questions were answered, in the semester dataset it is two times the total number of 5. ACKNOWLEDGMENTS available questions, indicating that students redid the exer- We would like to thank the professors that contributed with cises even though they had already succeeded at it. This the research, either by adopting the tool in class or by evalu- ating the results. This work was supported by CNPq (141089/2016-[10] A. Lalwani and S. Agrawal. What Does Time Tell? 4) and was supported in part by FAPERJ (E26/202.838/2017- Tracing the Forgetting Curve Using Deep Knowledge CNE), CAPES (PROEX - 1201036), and CNPq (306258/2019- Tracing. In S. Isotani, E. Millán, A. Ogan, 6). P. Hastings, B. McLaren, and R. Luckin, editors, Artif. Intell. Educ., LNCS, pages 158–162, June 2019. https://doi.org/10.1007/978-3-030-23207-8_30. 6. REFERENCES [11] A. Latham, K. Crockett, D. McLean, and [1] L. Benotti, F. Aloi, F. Bulgarelli, and M. J. Gomez. B. Edmonds. A conversational intelligent tutoring The effect of a web-based coding tool with automatic system to automatically predict learning styles. feedback on students’ performance and perceptions. In Comput. & Educ., 59(1):95 – 109, Aug. 2012. Proc. 49th ACM Tech. Symp. Comp. Sci. Educ., https://doi.org/10.1016/j.compedu.2011.11.001. SIGCSE ’18, pages 2–7, Baltimore, Maryland, USA, [12] R. Lobb and J. Harlow. Coderunner: A tool for Feb. 2018. assessing computer programming skills. ACM Inroads, http://doi.acm.org/10.1145/3159450.3159579. 7(1):47–51, Feb. 2016. [2] P. Brusilovsky, S. Edwards, A. Kumar, L. Malmi, https://doi.org/10.1145/2810041. L. Benotti, D. Buck, P. Ihantola, R. Prince, T. Sirkiä, [13] K. Nagatani, Y. Y. Chen, Q. Zhang, F. Chen, M. Sato, S. Sosnovsky, J. Urquiza, A. Vihavainen, and and T. Ohkuma. Augmenting knowledge tracing by M. Wollowski. Increasing adoption of smart learning considering forgetting behavior. In Proc. World Wide content for computer science education. In Proc. Web Conf., WWW ’19, San Francisco, CA, USA, May Work. Group Rep. 2014 Innov. Technol. Comput. Sci. 2019. http://doi.org/10.1145/3308558.3313565. Educ. Conf., ITiCSE-WGR ’14, page 31–57, Uppsala, Sweden, June 2014. [14] P. Nedungadi and M. S. Remya. Incorporating https://doi.org/10.1145/2713609.2713611. forgetting in the Personalized, Clustered, Bayesian Knowledge Tracing (PC-BKT) model. In Proc. 2015 [3] B. Choffin, F. Popineau, Y. Bourda, and J.-J. Vie. Int. Conf. Cogn. Comput. Inf. Process., Noida, India, DAS3H: Modeling Student Learning and Forgetting May 2015. for Optimally Scheduling Distributed Practice of https://doi.org/10.1109/CCIP.2015.7100688. Skills. In Proc. 12th Int. Conf. Educational Data Mining, pages 29–38, Montreal, Canada, July 2019. [15] M. Pratusevich. Practice python, 2017. www.practicepython.org. [4] C. Delgado, J. Da Silva, F. Mascarenhas, and A. Duboc. The teaching of functions as the first step [16] Y. Qiu, Y. Qi, H. Lu, Z. Pardos, and N. Heffernan. to learn imperative programming. In Anais do Does Time Matter? Modeling the Effect of Time with Workshop sobre Educação em Computação (WEI), Bayesian Knowledge Tracing. In M. Pechenizkiy, pages 388–397. Sociedade Brasileira de Computação - T. Calders, C. Conati, S. Ventura, C. Romero, and SBC, Jan. 2016. J. Stamper, editors, Proc. 4th Int. Conf. Educational https://doi.org/10.5753/wei.2016.9683. Data Mining, pages 139–148, Eindhoven, the Netherlands, July 2011. [5] J. Hu. Python programming exercises, 2018. https://github.com/zhiwehu/ [17] D. Robinson. The incredible growth of python. Python-programming-exercises. https://stackoverflow.blog/2017/09/06/ incredible-growth-python/, 2017. Online; accessed [6] P. Ihantola, A. Vihavainen, A. Ahadi, M. Butler, 12-June-2020. J. Börstler, S. H. Edwards, E. Isohanni, A. Korhonen, A. Petersen, K. Rivers, M. A. Rubio, J. Sheard, [18] J. Self. Theoretical foundations for intelligent tutoring B. Skupas, J. Spacco, C. Szabo, and D. Toll. systems. J. Artif. Intell. Educ., 1(4):3–14, Sept. 1990. Educational data mining and learning analytics in [19] S. Sentance and A. McNicol. Python school, 2016. programming: Literature review and case studies. In https://pythonschool.net/. Proc. 2015 ITiCSE Work. Group Rep., ITICSE-WGR [20] Stack Overflow. Developer survey results 2018. https: ’15, pages 41–63, Vilnius, Lithuania, July 2015. //insights.stackoverflow.com/survey/2018/, 2019. http://doi.acm.org/10.1145/2858796.2858798. Online; accessed 12-June-2020. [7] I. Jivet, M. Scheffel, M. Specht, and H. Drachsler. [21] E. Thompson, A. Luxton-Reilly, J. L. Whalley, M. Hu, License to evaluate: Preparing learning analytics and P. Robbins. Bloom’s taxonomy for cs assessment. dashboards for educational practice. In Proc. 8th Int. Jan. 2008. Conf. Learn. Analytics Knowl., LAK ’18, page 31–40, [22] K. Vanlehn. The behavior of tutoring systems. Int. J. Sydney, New South Wales, Australia, Mar. 2018. Artif. Intell. Ed., 16(3):227–265, Aug. 2006. https://doi.org/10.1145/3170358.3170421. [23] W3Resource. W3resource, 2018. https://www. [8] E. Johns, O. Mac Aodha, and G. J. Brostow. w3resource.com/python/python-tutorial.php. Becoming the expert-interactive multi-class machine teaching. In Proc. IEEE Conf. Comput. Vision Pattern Recognit., pages 2616–2624, 2015. [9] A. N. Kumar. The effect of using problem-solving software tutors on the self-confidence of female students. In Proc. 39th SIGCSE Tech. Symp. Comput. Sci. Educ., SIGCSE ’08, page 523–527, Portland, OR, USA, Mar. 2008. https://doi.org/10.1145/1352135.1352309.