Mining an Online Judge System to Support Introductory Computer Programming Teaching Rodrigo Elias Francisco Ana Paula Ambrosio Instituto Federal Goiano Instituto de Informatica Campus Morrinhos Universidade Federal de Goiás Morrinhos – GO – Brazil Goiânia – GO – Brazil +55 (64) 34137900 +55 (62) 35211181 rodrigo.francisco@ifgoiano.edu.br apaula@inf.ufg.br ABSTRACT The BOCA software [1] is an online judge system used in Computer programming is an activity which requires a set of programming marathons in Brazil. It is freely available for cognitive processes that naturally develop through practice, institutions to download and install. This particular system allows writing algorithmic solutions. Students learn a lot from their teachers to register problems and to track their students’ work. mistakes, but for this they need feedback on their workouts. However this system is neither easy to handle nor has an exercise Marking students’ work outs is very time consuming, which often database (DB), needed to facilitate the generation of problem lists. limits a teacher’s capacity to offer close guidance individually. This project proposes to extend the BOCA online judge to make it The PROBOCA project aims to build a tool, based on the BOCA more suitable for use in introductory programming teaching. The online judge, suited for the purpose of learning computer resulting system, called PROBOCA, complements the online programming by practice. In addition to a problem database judge functionality with features that improve its ease-of-use, organized by theme and difficulty, the system provides enhancing teacher productivity and course effectiveness. functionalities to support the teacher in the classroom. One of the One of the introduced features provides automatic classification of main endeavors is to develop a procedure for estimating the problem difficulty, based on submitted solutions and students’ degree of difficulty of a certain problem. This “nominal” evaluation of degree of difficulty encountered in solving a given parameter may then be compared to the difficulty level as problem. This will aid teachers while composing the exercise lists, perceived by each student. The result is a valuable indicator of allowing them to better gauge the list complexity. It also lets those students that are experiencing challenges. This paper students organize the order of problems to solve, tackling the presents the preliminary specification of PROBOCA´s easier problems first before turning to more complex ones. architecture and functional requirements along with its current state of development. Another additional feature is the production of student reports based on the submitted solutions and the students’ behavior while accessing the tool. Student evaluation of program difficulty, Keywords number of submissions for the same exercise, time between Online Judge; Computer Programming Education. submissions, order of submissions, among other information gathered by the tool, reveal information about student behavior 1. INTRODUCTION and his ease in solving the proposed problems. The learning process of Computer Programming (CP) usually involves practicing the resolution to many problems. Students 2. EXERCISES ON PROGRAMMING write code to implement algorithms that meet exercise EDUCATION requirements and teachers should review those codes and present Aiming at automatic correction of program code and student their feedback to the students. As this is a time consuming task, monitoring and evaluation during the programming process, some teachers are installing and using online judge systems as several initiatives have been developed. automatic reviewing and scoring tools. Within this line of research, the study of Chaves et al [2] aims to Online judge refers to software tools originally designed for explore resources of online judges looking to integrate them into programming competitions. They usually run on a server, and the Moodle system. The goal is to provide a Virtual Learning contestants access it online. Basically, its role is to present the Environment (VLE) with automatic evaluation feature, allowing contestant teams with a list of problems, to which they should the teacher to monitor students´ problem solving. The authors respond by uploading program codes that satisfy the criteria of defined an architecture containing a module for integration with each problem. The tool then evaluates the answer code by using a online judges (MOJO). The Integration Module integrated the set of predefined inputs and comparing the program results to VLE structure to URI Online Judge [3] and SPOJ Brazil [4]. predefined outputs. If the output of the answer code exactly matches the expected output for the corresponding input, the In [5], the authors have developed a prototype intended to be an answer code is considered correct. Otherwise it is considered educational online judge. They argue that online judge tools are incorrect. No indication of where the program went wrong is suited to competition and have few educational features. They given. Although helpful as a teaching assistant, these tools were also criticize the way online judges provide feedback to students not designed for use in a classroom and therefore lack some only indicating whether the answer is right/wrong or if there were features that are important for academic management. errors at compilation/runtime. Their project, called JOnline, aims to add the following didactic features: Presenting tips in Portuguese to fix compilation errors found in the submitted source traditional exercise lists. The burden on the teacher is significantly code; presenting the test cases that generate erroneous results; reduced and the students get feedback. Teachers are then able to organization of the problems by topic; difficulty level deduced focus on helping the students that fail to solve the problems even from a poll conducted by the system and a resource for after several attempts. This strategy allows students to tackle a collaborative programming that allows two students to co-write larger number of exercises, thus increasing their potential to one common code. master cognitive skills required for programming. However, in Automatic generation of exercise lists based on user defined spite of these advantages, some drawbacks were revealed. criteria requires the classification of problems according to these Looking at the process of reviewing students’ answers two issues criteria. In [6], the authors conclude that there is a strong were identified. First, the system assesses a submitted program by relationship between the difficulty of a problem and the total comparing its output, as generated by the student’s code in number of lines of code and amount of flow control in the response to a given input data set, to the registered expected program (IF, WHILE, FOR ...). New students have difficulty in output. For the two outcomes to be identical, exact formatting of reading and interpreting problem statements. This problem has the program’s output is required. At the beginning students been related to the difficulty in dealing with abstraction [9]. incurred in many errors just for using a slightly different format. This situation marks the problem’s result as wrong whereas the 3. BOCA program’s logic maybe right, causing frustration among students Designed for use in programming marathons, the BOCA online as the system does not point out where the error is. Students, judge system has vocabulary and requirements contextualized by however, managed to adapt and began paying more attention to Competition and Teams. The main interest in this system is that it output formatting. Second, there is no cross-answers plagiarism was designed to enable its users, the competitors, to interact with identification, in other words no control over cheats. It was a set of problems. Figure 1 shows a use case diagram with the observed that some students simply copied their colleagues’ different functions provided by BOCA. Its software allows program and submitted them as their own, thus considered to have registration of problems for a given competition. That is done by solved the problem without any real comprehension of the submitting a PDF file stating the problem to be tackled and the solution. files containing the input and output data sets for the Figure 2 shows the BOCA code submission screen. It was corresponding test cases. The registered problems are associated modified to include the “level of difficulty” selection where the with a single competition. If needed, to reuse the problems for students evaluate how hard it was to solve that problem they are another competition, the files must be inserted again. BOCA does submitting. not include a database to store the problems. Figure 2. BOCA’s submission functionality answers Regarding the class management functionality, it was noted that each installed instance of BOCA supports one active competition Figure 1. BOCA´s use case diagram at a time, meaning a single class per instance. Thus, if a teacher wants to have several competitions running at the same time i.e. In the system, colored balloons represent the points gained by the different groups doing the same or different sets of exercises, he competitors. This feature is part of the programming marathons must install multiple instances of BOCA. Each instance is context where correctly answering a problem yields a balloon to independent. So even if two competitions are composed of the its team. Besides allowing teams to monitor their submissions and same problems (e.g. for different classes), these problems must be the related results, BOCA also provides a screen displaying the separately registered for each instance. As each competition is teams ranking including the team’s overall score, and a list of the independent, the system does not group the results of the different solved exercises using balloons to identify those successfully competitions. This feature would be interesting in case of multiple done. It also produces information on students’ interaction with exercise lists delivered to the same class. It should also be noted the problems containing the time spent and the number of that the system is not trivial to install and manage, which ends up resolution attempts. discouraging the teachers from adopting it. On top of that, the BOCA has been used in CP introductory classes at our institute teacher needs to add each student to the system, which proves to for some years, with good results. Using this system enhances the be quite tedious, especially for large classes. To overcome this practice of problem solving by providing automatic feedback to drawback, teachers have implemented programs that automatically students. We have observed that the number of problems solved generate registration ids and corresponding passwords based on by students using BOCA is significantly higher when compared to student university registration number. As BOCA does not allow password modification, students often know other students parameters, namely a difficulty level (1-easy; 2-medium; 3- password as they are generated following certain patterns. This difficult); the desired component syllabus´ elements for the given facilitates copying other students’ solutions. problem set and finally the quantity of problems to compose the This shows that, although BOCA has several advantages, there are list. The system then analyzes which problems best fit the user- problems that hinder the use of the system in the classroom. To defined parameters and generates a competition list, from among solve this, beyond what has already been presented, we aim to the problems available in its DB. The difficulty level for each tackle the following requirements: automatic generation of lists of problem is estimated automatically by the system based on data problems based on subjects and level of difficulty, measurement obtained from solutions submitted by students and other of the student experience with problem solving by subject, and parameters as described in section 6. rank problems by levels of difficulty. We consider the last item is important for didactic purposes, not finding a technique or algorithm to use, this is the focus of this work. 4. PROBOCA From an architectural viewpoint, the proposed system builds upon BOCA that is considered an internal component and offers its structure, PHP source code and PostgreSQL DB to be reused. In order to avoid further complication of the development process, the “Competitions and Teams” context is kept, with some adaptations. The terms “Competition” and “Team” are used to loosely indicate exercise list and student respectively. BOCA´s user interface allows for such extrapolation which is already in practice by the teachers who use this system in their classroom. Adapting BOCA to support CP teaching brought the need to introduce new functional requirements, as presented in the use case diagram in Figure 3. PROBOCA required some changes to the BOCA DB structure. It was necessary to modify the internal tables in order to link the stored problems to given course’s syllabus and certain difficulty Figure 4. Generate problem lists\competitions levels as well as to include more detailed data about the students. Furthermore, the original competition-based structure will be altered to adapt it to the concept of multiple exercise lists. R3’s purpose is to provide teachers with a report of every student´s experience level based on his interaction with the system. Using the collected data, it should be possible to present information about a student's experience regarding the syllabus elements addressed in each problem, thus allowing the teacher to detect where students show more difficulty. This requirement depends on R4, which measures the difficulty level of a given problem. A simple mechanism to measure the degree of difficulty of a given problem was developed and tested. For R3 to be reliable, it is necessary that a plagiarism identification strategy (R9) be also implemented in the system. R5 is aimed at associating a student to multiple competitions without the need to register each time. Thus, the data from different competitions can be associated to the student, allowing the integration of results obtained for the different lists presented to a given class. This requirement has a dependency upon R6, which aims to adapt BOCA to judge problems from several competitions. R7 is also related to R5, with the goal of allowing a student to participate in several competitions - analog to Figure 3. Use Case Diagram with New Requirements answering several problem lists. R8 aims to display the problems’ balloons, color coded to indicate Following requirement R1, student registration is now done by the the problem’s difficulty level. Currently, balloon colors are student himself using a token which is handed out by the teacher. registered by the system administrator along with the problems Besides saving a password, other information is collected during and have no relation to difficulty or content, unless the registration that permits class wide statistical analysis. classification and corresponding color attribution is done by the teacher when uploading the problems. R2 was intended to simplify the task of competition generation, which is performed by the teacher. To achieve this goal, a database of problems was implemented. Currently, to generate a competition, as shown in figure 4, the user indicates three 5. IMPLEMENTATION Nonetheless, this work investigates the possibility of characterizing a programming problem in a way such that calculated parameters may correlate to a determined “Difficulty- The functionalities of detailing the student information and level” as expressed by students. providing automatic generation of exercise lists are already implemented and tested. Requirement R4, estimating the difficulty In their work [6], Alvarez and Scott present the program control level of a given problem, is currently in progress. flow and number of lines of code as variables that correlate to the difficulty of a problem. The experiments undertaken in the current Within this requirement, the first executed task was to create 74 study corroborate this information. exercises using BOCA´s problem-registration functionality. These exercises cover the syllabus of the “Introduction to programming” Also, other works [7, 8] show the need to deal with abstraction to course (CS1) and range from debutant programming instructions solve problems. A highly abstract problem is one that implies a (Hello World style) up to coding with matrixes. The inserted greater level of generalization. Thus, a problem that involves problems populated an initial database to be used for testing, but many different topics can become more abstract and hence more additional problems are expected to be easily included. An difficult. It is fairly safe to assume that in order to measure the estimated level of difficulty for each problem was supplied by the difficulty of a problem, it is necessary to make a detailed analysis teacher inserting the problem in the database. of the topics that are addressed within the problem. That proposition helps in formulating the hypothesis below. To submit an answer, the student uploads one code file per problem solved. Along with the file upload, the student is asked to 6.1 Problem Difficulty Mechanism evaluate the problem´s difficulty, by indicating one of three choices: Easy, Medium or Difficult, based on his experience when On one hand, the survey of the student´s opinion of the submitted solving the problem (Figure 4). problem’s difficulty provided the first estimate. For each of the 74 Success was obtained in computing several parameters that are registered exercises, this information was stored, along with the needed to calculate problem difficulty. They include different respective answer in the DB. Although information was collected measures, such as counting the number of repetition and selection for all students, statistics were calculated only considering those structures used in the solution and the number of topics involved that completed at least 95% of the exercises (70 out of 74). This in the problem. Problem topics were defined using the excluded students that dropped out the course in order to prevent introductory programming course syllabus. They include, their partial answers from skewing the results. After filtering, the input/output; attribution; selection; repetition; vectors; strings; mean difficulty “Mean_Dif” was then calculated for each exercise matrices and functions. In this list, topics appear in the order they based on the answers of the resulting 62 students. “Median_Dif” are taught and therefore in increasing level of complexity for and “Mode_Dif” were also calculated. novel students. Since programming is incremental, more complex In addition, C source code was written to correctly workout each topics usually base upon lesser complex topics. Several tests have of the registered problems. A PHP program was also developed in been conducted to define and validate the mathematical function order to analyze the programs´ internal structures and to extract that will calculate the difficulty of the problem, but this is still an some measures from each program. The measures include open issue. counting: Based on the calculated difficulty and the topics involved in the problem, colored balloons are associated to the problem. Different • Lines of code, represented by variable N_LC; colors represent different topics, and within each topic, the level • Repetition structures used in the solution, N_RP; of difficulty is represented by the intensity of the color. For • Selection structures, N_SL; example, blue represents problems whose most complex topic is selection. Light blue balloons are associated to easy problems • Edges in a graph that represents the algorithm, N_EDG; using selection. Medium blue balloons are associated to medium • Edges in a graph that represent repetition structures in difficulty problems and dark blue balloons are associated to the algorithm, N_EDG_RP; problems that demand more from students. • Height of a tree, with each sub-block of internally The task of providing teachers with a report on student nested code representing a node, MX_HT and achievements (R3) has not been tackled yet. Ideas in this sense include: (1) comparing student’s perceived problem difficulty and • Number of topics involved in the problem, N_TPC. mean problem difficulty. If students perceive problems as harder This last count was obtained manually. or easier than the mean this could indicate that they are at a lesser or higher level of programming competency; (2) comparing successive submission of the same problem. This may show if To verify if a combination of variables obtained better results, the students adopt a trial-and-error approach; (3) mean time taken to following formulae were also tested against the Mean_Dif solve problems; (4) mean number of submission per problem; (5) variable to verify their correlation. Table 2 shows the results. score when compared to the other students; among others. Mean_Dif correlated positively with all measured variables, being the best correlation associated to f4, r = .76, p = .000, that 6. Estimating Problem Difficulty improves on the individual correlations with N_TPC (number of topics) and MX_HT (maximum tree height). In a sense, “difficulty” expresses the lack of ability, or amount of skill/effort needed to accomplish something. Obviously, the perception of difficulty is an individual matter that is related to many aspects, including the time at which the question was posed. Table 1. Correlation between measured variables and student Table 3. Correlation between measured variables and teacher evaluation evaluation Median_ Mean Mode_ Median_Prof_Dif Dif _Dif Dif N_RP Pearson Correlation .62 N_RP Pearson Correlation .49 .52 .44 Sig. (2-tailed) (Repetitions) .000 (Repetitions) Sig. (2-tailed) .000 .000 .000 N N 74 74 74 74 N_SL Pearson Correlation .20 N_SL Pearson Correlation .16 .34 .19 Sig. (2-tailed) (Selections) .093 (Selections) Sig. (2-tailed) .171 .003 .108 N N 74 74 74 74 N_LC Pearson Correlation .56 N_LC Pearson Correlation .44 .62 .45 Sig. (2-tailed) (Lines of Code) .000 (Lines of Sig. (2-tailed) .000 .000 .000 N N 74 Code) 74 74 74 N_TPC Pearson Correlation .50 N_TPC Pearson Correlation .57 .69 .59 Sig. (2-tailed) (Topics) .000 (Topics) Sig. (2-tailed) .000 .000 .000 N N 74 74 74 74 N_EDG Pearson Correlation .53 N_EDG Pearson Correlation .42 .58 .41 Sig. (2-tailed) (Edges) .000 (Edges) Sig. (2-tailed) .000 .000 .000 N N 74 74 74 74 MX_HT Pearson Correlation .50 MX_HT Pearson Correlation .52 .67 .56 Sig. (2-tailed) (Tree Height) .000 (Tree Height) Sig. (2-tailed) .000 .000 .000 N N 74 74 74 74 N_EDG_RP Pearson Correlation .62 N_EDG_RP Pearson Correlation .53 .60 .51 Sig. (2-tailed) (Rep. Edges) .000 (Rep. Edges) Sig. (2-tailed) .000 .000 .000 N N 74 74 74 74 Table 4 correlates the teacher defined difficulty with the proposed f1 = N_RP+N_SL+N_LC+N_TPC+N_EDG+MX_HT+N_EDG_RP formulae defined above. Positive correlation was found with all 7 formulae, being the best correlation associated to f2, r = .63, f2 = N_EDG . 0.3 + MX_HT . 1.4 + N_EDG_RP . 1.4 + N_LC . 0.2 p=.000. Furthermore, a positive correlation was found between f3 = N_TPC+MX_HT+N_LC + N_EDG_RP+N_EDG+N_RP + N_SL the teacher defined difficulty and mean student perceived 3 8 difficulty, r = .69, p = .000. f4 = N_TPC + MX_HT 2 Table 2 shows the results. Mean_Dif correlated positively with all Table 4. Correlation of teacher defined difficulty and measured variables, being the best correlation associated to f4, r = developed formulae Mean_Dif .76, p= .000, that improves on the individual correlations with Pearson Correlation .59 N_TPC and MX_HT. f1 Sig. (2-tailed) .000 N 74 Pearson Correlation .63 Table 2. Correlation of student perceived difficulty and f2 Sig. (2-tailed) .000 developed formulae N 74 Mean_Dif r* .59 Pearson Correlation .66 f3 Pearson Correlation .000 Sig. (2-tailed) Sig. (2-tailed) 74 f1 .000 N N 74 Pearson Correlation .55 Pearson Correlation .68 Sig. (2-tailed) Sig. (2-tailed) f4 .000 f2 .000 N N 74 74 Pearson Correlation .67 f3 Sig. (2-tailed) .000 Although student perceived difficulty and teacher defined N 74 difficulty are correlated, there are differences that can be verified Pearson Correlation .76 by the correlations found with the measured variables and f Sig. (2-tailed) .000 N proposed formulae. This could be explained by the fact that 4 74 students evaluate difficulty based on the knowledge they have when developing the solution. As they do not know what comes We also verified correlations between teacher evaluations of the ahead, they cannot base their evaluation on the overall knowledge problems’ difficulty. For this we asked five teachers to read each of programming. Teachers on the other hand, have this overall problem and attribute a level of difficulty in the range 1-5. Since view of the domain and evaluate difficulty accordingly. It must be we had only five evaluations we chose to work with the median observed that the teachers that did the evaluation are new to difficulty obtained (Median_Prof_Dif). teaching and have not yet acquired a more critical understanding Correlating this variable to the measured variables we obtained of the difficulties students encounter when learning to program. the results presented in table 3. Best correlations were obtained After developing algorithmic reasoning, people tend to forget how with N_RP and N_EDG_RP, r = .62, p = .000, both related to the they thought before, and find that many concepts are obvious number of repetition structures found in the solution codes. when in fact, for beginners, they are not. 7. CONCLUSION REFERENCES PROBOCA is a system under development whose goal is to adapt [1] BOCA. BOCA Online Contest Administrator. Available in the BOCA software for use in teaching programming. While Access on March BOCA itself was developed for programming marathons, it is 20th, 2015 already in use, as a support tool, in programming introductory [2] Chaves, J. O. et al. Uma Ferramenta de Auxílio ao Processo courses. As a learning aid, BOCA has several limitations, de Ensino Aprendizagem para Disciplinas de Programação especially relative to class administration and student monitoring. de Computadores (2013). TISE 2013 In addition, as a system specifically developed for competitions, it lacks mechanisms that facilitate the creation of exercise lists, such [3] URI. URI Online Judge. Available in Access on May 20th, 2014. PROBOCA supports the persistence of problems registered by [4] SPOJ. SPOJ Brasil. Available in teachers in the database and provides greater access to students’ Access on May 20th, 2014. related information. Unlike other “Online Judge” systems that are [5] Santos, J. C. S., Ribeiro, A. R. L. JOnline - proposta available exclusively online and are managed by their creators, preliminar de um juiz online didático para o ensino de PROBOCA can be downloaded and installed by the teacher, programação (2007). SBIE 2011. giving the teacher control over the problems stored in the [6] Alvarez, A. and Scott, T. A. (2010). Using student surveys in database. Since teachers are responsible for introducing the determining the difficulty of programming assignments. problems, this solution has the additional advantage that it is Journal of Computing Sciences in Colleges, 26(2):157–163. language free, i.e., it is not limited to teachers and students that speak the language in which the problem specifications were [7] Gomes, A. et al. (2008). Aprendizagem de programação de written as is the case of online systems administered by third computadores: dificuldades e ferramentas de suporte. parties. Revista Portuguesa de Pedagogia. 42(2). In addition to implementing an environment that facilitates the use [8] Mendonça, A. et al. Difficulties in solving ill-defined of BOCA in teaching programming, PROBOCA also aims to problems: A case study with introductory computer provide teachers and students with information that will help programming students. In Frontiers in Education students in their learning process. One of the important features of Conference, 2009. FIE’09. 39th IEEE, pages 1–6. IEEE. PROBOCA is an automatic evaluation of problem difficulty. This [9] Mendonça, A., de Oliveira, C., Guerrero, D., and Costa, E. gives students direction in the path to follow when choosing (2009). Difficulties in solving ill-defined problems: A case which exercises to solve first, allowing them to solve easier study with introductory computer programming students. In exercises before more complex ones, diminishing student Frontiers in Education Conference, 2009. FIE’09. 39th frustration at not solving problems. It also allows teachers to IEEE, pages 1–6. IEEE. better gauge the level of difficulty of exercise lists. As shown by the collected data, teacher and student evaluation regarding problem difficulty do not match, and this may lead to distortions when teaching programming. Future work could include developing a system that will automatically suggest problems to students based on their performance and calculated problem difficulty. This work shows the approach being taken to calculate the difficulty of the problems. This approach differs from others by treating the algorithms submitted by students in the form of graphs and trees to identify properties that could be correlated with the difficulty of problems. For this, data mining using correlations was undertaken. Part of the system has already been implemented, and has been successfully used in the classroom. The success of these tasks shows the feasibility of the project and encourages further work.