ABSTRACT                                                          it can be used in a large variety of projects. This can be
The ability to understand a person’s knowledge is impor-          great to stimulate students, since they can work in projects
tant in educational settings. This can be used to recognize       they actually relate to. Python is also user-friendly and for
gaps in knowledge and to diagnose misunderstandings and           the past seven years, it has been the fastest-growing major
misconceptions. This paper presents an intelligent tutoring       programming language [17], being correlated with trending
system created to gather student knowledge data and the           careers, such as DevOps and Data Scientist [20]. However,
proposed methodology to generate the datasets. We asked           according to the 2015 review [6], only 11% of the Educa-
14 professors to determine the concepts found in a set of         tional Data Mining and Learning Analytics papers about
problems and we compare the student behavior found in             programming courses reported using Python as the course
each methodology.                                                 language.

Keywords                                                          These factors motivate our objective: the creation, deploy-
open datasets, intelligent tutoring system, educational data      ment and use of online intelligent systems in Introduction to
mining, computer science education                                Programming classes using the Python language as a way to
                                                                  uncover students’ difficulties, to understand their knowledge
                                                                  and to provide timely feedback to keep students engaged.
In this work, we have two main motivations: create an online      The main contributions of this paper are: 1) generating an
learning environment to gather data about students’ knowl-        anonymous open-source students interaction database using
edge and stimulate students to learn the Python language.         the Python language with corresponding solutions and, 2)
                                                                  developing the Machine Teaching system1 , an ITS with a
It is important to understand why students succeed or fail        built-in recommendation engine.
when taking a course so we can improve teaching methods
by identifying students’ needs and to provide personalized
education. Smart learning content is defined as visualiza-        2.     DATA ACQUISITION METHODOLOGY
tions, simulations and web-based environments that pro-           This section presents the web system developed to acquire
vide outputs for students based on the students’ input [2].       student data and interact with students and educators. The
The adoption of smart learning content in classrooms and          system architecture is proposed as an improvement from a
in self-learning environments motivates students [1, 7, 11,       common architecture from the Intelligent Tutoring System
12], improves student learning, decreases student dropout         literature. We also present the methodology used to collect
or failure [1, 9, 8, 6] while increasing their self-confidence,   student data.
especially in female students [9].
                                                                  2.1     System Architecture
Also, Python is a general-purpose language, which means           To capture students’ data, a web-system was designed and
∗Copyright c 2020 for this paper by its authors. Use per-         implemented. Intelligent Tutoring Systems (ITS) are sys-
mitted under Creative Commons License Attribution 4.0 In-         tems designed to assist the tutoring of students on a per-
ternational (CC BY 4.0).                                          sonalized level. They have been around since the 1960s [22],
                                                                  having its first formal definition in the 1990s [18]. Nowa-
                                                                  days, there are several different ITS covering a broad range
                                                                  of subjects with success used by hundreds of thousands of
                                                                  students a year. Ihantola et al. [6] present a table showing
                                                                  a list of them and their supported programming languages.

                                                                  Ihantola et al. [6] also define a common architecture for these
                                                                  systems in their 2015 review. Common front-end features
among them are an IDE for students to write, edit and ex-         Python web tutorials: Practice Python [15], Python School [19],
ecute code, a submission interface (can be embodied in the        Python Programming Exercises [5], and W3Resource [23].
IDE or in a separate part of the system), feedback for stu-       The chosen sources provided the exercise statements along
dents’ actions and visualization schemes for teachers and         with the exercise code solution. Students’ responses to these
researchers. Back-end usually supports saving data in some        exercises were collected in 10 different Introduction to Pro-
kind of storage, usually a relational database. However, part     gramming courses during 2 semesters. Students were as-
of this proposal is to build an integrated personalized exer-     signed two different strategies: either the system showed
cise recommendation engine within the system. Therefore,          random problems or they would follow a predefined path.
none of the existing systems could be used without modifi-        The students were introduced to the system at the end of
cations. To avoid legacy code and to have a better control        the semester before their finals exams, or at the beginning
of the desired features and captured student data, it was         of the next semester. In both cases, the exercises were sup-
decided to build one from scratch using open-source Python        posed to act as revision exercises of all CS1 content.
and Javascript libraries. In addition, the classes in which the
system was employed, use a modularization based approach          In our second approach, we accompanied the students through-
to teach imperative programming [4]. Each question requires       out a whole semester in four different Introduction to Pro-
self contained modules of code as answer which translates         gramming courses (they had the same syllabus, but different
in Python as functions. So, the proposed system also had          professors). Every week, an exercise list concerning the sub-
to take this format into consideration to correct students’       ject given in class was available in the system. They had
assignments. Besides these adaptations, it shares most of         one week as a deadline to finish them and their performance
the functionalities with the other systems. Fig. 1 illustrates    on these lists composed part of their final grade. In total,
an abstract view of the system, adapted from Ihantola et          they had to solve 65 problems.
al. [6]. The main differences are the built-in recommenda-
tion engine, that controls what exercises are shown in the        3.    RESULTS
IDE for the student to solve, the way of correcting students      This section presents statistics for both datasets, comparing
assignments and the extra collected data, like time spent         the different behaviors found in them.
typing and the total time spent solving the question.
                                                                  3.1    Revision Dataset
                                                                  In total, there are 3,632 records from 192 students with an
                                                                  average of 18.4 attempts in 4.4 problems. Also, the dataset
                                                                  is imbalanced: there are 764 (21%) successful attempts and
                                                                  2,868 (79%) failed attempts. It means that, on average,
                                                                  each student attempts a problem 4 times before getting all
                                                                  test cases correct in the fifth. Some simple statistics for
                                                                  the dataset are shown in Table 1. Fig. 2 is a histogram
                                                                  showing the distribution of success and failures per prob-
                                                                  lem. Similar behavior is found in the distribution of success
                                                                  and failures attempts per student. Both success attempt dis-
                                                                  tributions have smaller variance and smaller mean than the
                                                                  corresponding fail attempt distributions. This is expected
                                                                  since the students are given several tries to submit a correct
                                                                  response before moving on to the next problem.
Figure 1: Abstract view of the system architecture.
                                                                            Table 1: Revision dataset statistics
The exercises currently available in the system belong to                                  Avg   Median Min Max
either Applying or Creating categories in Bloom’s Taxon-           attempts per question   75.67    55      10   304
omy [21]. The students are presented with a problem and            attempts per student    18.44    11       2   266
they should write the expected answer in a free-text cod-          different students per 18.17     14       4    71
ing format. For each exercise, a test case function generator      question
was defined to correct the results. The students get feed-         different questions per  4.43     3       1    44
back every time they submit an answer and they can see             student
whether they passed or failed a unit test case. If they get all
of them correct, the task is considered done and the student
                                                                  For this dataset, we asked 14 professors to indicate the con-
may move on to another problem. The system saves a state
                                                                  cepts needed in each question for a student to be able to solve
every time a student submits an answer.
                                                                  it based on the exercise solution. The teaching experience
                                                                  of the professors range between 2 to 20 years and they were
2.2    Data Acquisition                                           not necessarily involved in the classes’ participating in this
For the data acquisition process, we used the system to col-      work. Each professor should associate up to three concepts
lect the data in two approaches. The system was either            (from 15 available) to 15 randomly assigned code snippets.
used a single time for revision purposes (revision dataset),      On average, each code received four evaluations. From the
and throughout a whole semester (semester dataset).               54 code snippets, 37 of them had one or more concepts in
                                                                  which all the professors agreed. If we lower the threshold
The first approach uses 48 CS1 problems crawled from four         to contabilize where at least 75% of the professors agree (3
                                                                 can be used to measure if they are using the system and the
                                                                 exercises to study and review the content, for example.

                                                                 Fig. 3 shows the quantity of students per number of ques-
                                                                 tions histogram. We can notice that only 15 out of the
                                                                 181 (8%) students attempted more than 59 out of the 65
                                                                 (92%) different exercises. The exercise distribution is ac-
                                                                 tually quite flat. If we divide it in three equal parts, ap-
                                                                 proximately one third of the class (64 students, 35%) at-
                                                                 tempted only one third of the exercises, one third (60 stu-
                                                                 dents, 33%) attempted two thirds of the exercises and the
                                                                 last third (57 students, 31%) attempted between 45 and 65
                                                                 exercises. When provided in real time, this information can
                                                                 be used by the professors to find students that are having
                                                                 difficulties finishing exercises and provide personalized assis-
Figure 2: Distribution of questions’ success and fail

out of 4), this number increases to 53. So, we decided to
use this 75% threshold of agreement to relate the concepts
and the exercises. The concepts are not mutually exclusive
and a problem can be assigned to more than one concept.
Around 45% of the problems involved the loop concept and
40% the conditional concept. About 22% involved working
with strings and 12% involved math questions.

3.2   Semester Dataset
This dataset is 7.5 times bigger in number of attempts than
the previous one, containing 27,491 attempts records. How-
ever, since it accompanied the same students for an entire
semester, the number of students is actually smaller: 181 dif-   Figure 3: Distribution of students per number of
ferent students. Simple statistics for the dataset are shown     questions
in Table 2. This dataset has a slightly higher success rate
than the previous one, although it is still very imbalanced.
It contains 6,849 (24.91%) success attempts against 20,642       4.    CONCLUSIONS
(75.09%) unsuccessful ones. This can be explained by the         In this paper, we presented the Machine Teaching ITS sys-
fact that each weekly set of exercises covers mostly what was    tem, developed to assist students and professors in modular-
seen in class in the same week, so students did not have a lot   ized function-based Python classes. We also presented and
of time to forget the subject [3, 10, 13, 14, 16], in contrast   analyzed the two approaches in which data were collected.
to the review exercises that were done before finals or at the   For the first approach, we invited professors to associate the
beginning of the next semester.                                  needed concepts for each question and calculated their agree-
                                                                 ment rate. In general, they agree in at least two concepts for
                                                                 each exercise. For the second approach, we visualized the
         Table 2: Semester dataset statistics                    distribution of the dataset, which can be used by professors
                        Avg   Median Min Max                     during the semester to identify students with difficulties.
 attempts per ques- 422.94      349      85   1291
 tion                                                            4.1    Future work and call for collaboration
 attempts per student 151.88    114       2   1002               We are currently working on integrating a recommendation
 different students per 87.22    87      39    145               engine in the system. After this step, we will perform some
 question                                                        A/B tests with professors and students and collect their
 different    questions 31.32    31       1     65               opinion on the available tools. In addition, there is a lot to
 per student                                                     be explored on these datasets. For example, we would like
                                                                 to study the differences in behavior between revision and
                                                                 semester students, investigate student procrastination, infer
Another interesting difference between the datasets is the       student knowledge based on their answers, research tempo-
average number of attempts per question. Whereas in the          ral learning effect, among other ideas. We invite researchers
revision dataset it is lower than the total number of ques-      interested in exploring the dataset to contact the authors.
tions, indicating that not all the questions were answered,
in the semester dataset it is two times the total number of      5.    ACKNOWLEDGMENTS
available questions, indicating that students redid the exer-    We would like to thank the professors that contributed with
cises even though they had already succeeded at it. This         the research, either by adopting the tool in class or by evalu-
ating the results. This work was supported by CNPq (141089/2016-[10] A. Lalwani and S. Agrawal. What Does Time Tell?
4) and was supported in part by FAPERJ (E26/202.838/2017-            Tracing the Forgetting Curve Using Deep Knowledge
CNE), CAPES (PROEX - 1201036), and CNPq (306258/2019-                Tracing. In S. Isotani, E. Millán, A. Ogan,
6).                                                                  P. Hastings, B. McLaren, and R. Luckin, editors,
                                                                     Artif. Intell. Educ., LNCS, pages 158–162, June 2019.
