Intelligently Raising Academic Performance Alerts Dimitris Kalles1, Christos Pierrakeas and Michalis Xenos Abstract. We use decision trees and genetic algorithms to analyze the academic performance of students and the homogeneity of tutoring teams in the undergraduate program on 2 THE EDUCATIONAL BACKGROUND Informatics at the Hellenic Open University (HOU). Based on A module is the basic educational unit at HOU. It runs for the accuracy of the generated rules, we examine the applicability about ten months and is the equivalent of about 3-4 of the techniques at large and reflect on how one can deploy conventional university semester courses. A student may such techniques in academic performance alert systems. register with up to three modules per year. For each module, a student is expected to attend five plenary class meetings throughout the academic year. A typical class contains about 1 INTRODUCTION thirty students and is assigned to a tutor (tutors of classes of Student success is a natural performance indicator in the same module collaborate on various course aspects). universities. However, if that success is used as a criterion for Class face-to-face meetings are about four hours long and are tutor assessment (and subsequent possible contract renewal), and structured along tutor presentations, group-work and review if students must evaluate their own teachers, then tutors may tend of homework. Furthermore, each student must turn in some to lax their standards. This paper is about dealing with this issue written assignments (typically four or six), which contribute in the context of the Hellenic Open University (HOU); we focus towards the final grade, before sitting a written exam. That on the undergraduate Informatics program (about 2,500 exam is delivered in two stages: you only need sit the second students). We ask whether we can detect regularities in distance if you fail or miss the first. tutoring, then, we try to associate them with measures of Students fail a module and may not sit the written exam if students’ success in an objective way and, subsequently, reflect they do not achieve a pass grade in the assignments they turn on how to effectively disseminate this information to all in; these students must repeat that module afresh. A student interested parties. who only fails the written exam may sit it on the following The measurement strategy we have developed to-date in HOU academic year (without having to turn in assignments); such has been progressively refined to deal with two closely linked “virtual” students are also assigned to student groups but the problems: that of predicting student success in the final exams tutor is only responsible for marking their exam papers. and that of analyzing whether some specific tutoring practices have any effect on the performance of students. Each problem gives rise to the emergence of a different type of user model. A 3 GENETIC ALGORITHMS AND student model allows us, in principle, to explain and maybe DECISION TREES FOR PREDICTION predict why some students fail in the exams while others succeed. A tutor model allows us to infer the extent to which a In our work we have relied on decision trees to produce group of tutors diffuses its collective capacity effectively into the performance models. Decision trees can be considered as rule student population they advise. However, both types of models representations that, besides being accurate, can produce can be subsequently interpreted in terms of the effectiveness of comprehensible output, which can be also evaluated from a the educational system that the university implements. qualitative point of view [1, 2]. In a decision tree nodes The rest of this paper is organised in five sections. The next contain test attributes and leaves contain class descriptors. section presents the educational background. Section 3 then A decision tree for the (student) exam success analysis reviews the fundamental features of the AI techniques that we problem could look like the one in Figure 1 and tells us that a have used. Following that we report the experimental results for mediocre grade at the second assignment (root) is an the undergraduate programme that we have analysed, as well as a indicator of possible failure (left branch) at the exams, short evaluation of the individual module results that seem to whereas a non-mediocre grade refers the alert to the fourth signify an interesting deviation. Section 5 presents a discussion (last) assignment. from the point of view of how one can generalise our approach Decision trees are usually produced by analyzing the as well as how one can substitute other intelligent techniques for structure of examples (training instances), which are given in data analysis; finally we conclude and describe directions for a tabular form. An excerpt of a training set that could have future development. produced such a tree is shown in Table 1. Note that the three examples shown are consistent with the decision tree. As this may not always be the case, there rises the need to measure accuracy, even on the training set, in order to compare the quality of two decision trees which offer competing explanations for the same data set. 1 All authors are with Hellenic Open University, www.eap.gr. Contact address is dkalles@acm.org. 37 Of course, GATREE was first used [3] to confirm the qualitative validity of the original findings experiments [4], also serving as result replication, before advancing to more Assgn2 in [3..6] elaborate experiments [7, 8, 9]. GATREE [6] evolves populations of trees according to a fitness function that allows for fine-tuning decision tree size vs. accuracy on the training set. At each generation, a certain FAIL Assgn4 < 3 population of decision trees is generated and sorted according to fitness. Based on that ordering, certain genetic operators are performed on some members of the population to produce a new population. For example, a mutation may FAIL PASS modify the test attribute at a node or the class label at a leaf, while a cross-over may exchange parts between decision trees. Figure 1. A sample decision tree [3]. The fitness function is fitnessi=Correcti2*x/(sizei2+x), for tree i. The first part of the product is the actual number of Note that the sample decision tree does not utilize data neither training instances that i classifies correctly. The second part on the first nor the third assignments, but such data is shown in of the product (the size factor) includes a factor x which the associated table. Such dimensionality reduction information regulates the relative contribution of the tree size into the is typical of why decision trees are useful; if we consistently overall fitness; thus, the payoff is greater for smaller trees derive trees on some problem that seem to not use some data When using GATREE, we used the default settings for the column, we feel quite safe to not collect measurements for that genetic algorithm operations and set cross-over probability at data column. Of course, simple correlation could also deliver 0.99 and mutation probability at 0.01. Moreover, all but the such information, however it is the visual representation simplest experiments (explicitly so identified in the following advantages of decision trees that have rendered them as very sections) were carried out using 10-fold cross-validation, on popular data analysis tools. which all averages are based (i.e. one-tenth of the training set was reserved for testing purposes and the model was built by Table 1. A sample decision tree training set (adapted from [3]). training on the remaining nine-tenths; furthermore, ten such stages were carried out by rotating the testing one-tenth. Assgn1 Assgn2 Assgn3 Assgn4 Exam ... ... ... ... ... 4.6 7.1 3.8 9.1 PASS 4 DATA ANALYSIS AT A PROGRAMME 9.1 5.1 4.6 3.8 FAIL LEVEL 7.6 7.1 5.8 6.1 PASS Before advancing, we first review some aggregate statistics of the undergraduate informatics programme at HOU. Analyzing the performance of high-risk students is a goal First, Table 2 presents the success rates for the modules towards achieving tutoring excellence. It is, thus, reasonable to that we have analysed. assert that predicting a student’s performance can enable a tutor to take early remedial measures by providing more focused Table 2. Success (percentage) rates of modules. coaching, especially in issues such as priority setting and time 2004-5 2005-6 2006-7 management. INF10 35% 38% 33% Initial experimentation at HOU [4] consisted of using several machine learning techniques to predict student performance with INF11 55% 52% 55% reference to the final examination. The scope of the INF12 39% 34% 35% experimentation was to investigate the effectiveness and INF20 56% 44% 44% efficiency of machine learning techniques in such a context. The INF21 37% 44% 37% WEKA toolkit [5] was used because it supports a diverse collection of techniques. The key result was that learning INF22 71% 61% 55% algorithms could enable tutors to predict student performance INF23 N/A 83% 97% with satisfying accuracy long before final examination. The key INF24 70% 64% 58% finding that lead to that result was that success in the initial INF30 81% 85% 84% written assignments is a strong indicator of success in the examination. Furthermore, our tutoring experience corroborates INF31 93% 92% 85% that finding. INF35 N/A 98% 93% We then employed the GATREE system [6] as the tool of INF37 N/A 98% 98% choice for our experiments, to progressively set and test INF42 N/A N/A 100% hypotheses of increasing complexity based on the data sets that were available from the university registry. The formation and development of these tests is the core content of this chapter and Next, Table 3 presents the enrolment numbers for these is presented and discussed in detail in the following sections. modules. Note that, as we advance from junior to senior GATREE is a decision tree builder that employs genetic years, the overall enrolment is dramatically reduced and the algorithms to evolve populations of decision trees; it was success rates increase. eventually used because it produces short comprehensible trees. 38 Table 3. Enrollment numbers at modules. Table 4. Model accuracies omitting tutor data. 2004-5 2005-6 2006-7 2004-5 2005-6 2006-7 INF10 987 1.247 1.353 E F E F E F INF11 492 517 642 INF10 83 84 84 82 83 82 INF12 717 818 925 INF11 75 76 76 78 75 80 INF20 362 389 420 INF12 74 76 86 74 78 74 INF21 322 363 383 INF20 76 70 76 59 87 60 INF22 321 291 321 INF21 83 78 76 72 77 73 INF23 N/A 52 73 INF22 68 80 68 76 63 70 INF24 157 167 221 INF23 N/A 46 78 89 99 INF30 156 198 199 INF24 67 67 68 66 69 70 INF31 149 200 144 INF30 77 82 64 85 71 94 INF35 N/A 101 58 INF31 65 95 86 93 68 91 INF37 N/A 106 132 INF35 N/A 72 97 80 92 INF42 N/A N/A 109 INF37 N/A 95 100 95 98 INF42 N/A N/A 96 100 The above statistics are all drawn from the university registry and none is subject to any further processing. However, all It is straightforward to attribute the increase in senior year results presented from now on, refer to experiments carried out modules to the fact that, eventually, students have to focus on totally using the GATREE system, with the occasional help of their exam and pass the test, regardless of how well they did some post-processing automation scripts. along the year. The large discrepancy, however, suggests that the exercises do not serve well their goal, which is to keep the students engaged in the learning process. One could say that 4.1 Detecting a shift in exam grades exercises are less of learning opportunities and more of There is a straightforward way to attempt to answer this necessary evils. question. One can build a model that attempts to answer the The dramatic decrease in the 2006-7 year results of the success question for the first stage of the final exam. Then, one INF20 module are quite interesting. They reflect, basically, a can build a model that attempts to answer the success question huge fail rate in the first stage of the exam, which is well for the overall student grade. A gross disparity in these numbers served by a small model that predicts failure all around. should be indicative of an issue that merits investigation. When seen from that viewpoint, however, the relatively The simplest data to consider as input for this problem narrow margins of the junior year modules seem quite consists of exercise and exam grades, as in Table 1, omitting any impressive, since they are also associated with low overall other information (for example, which tutor was responsible for pass rates. The difference, however, is that the junior modules a student). The results reported are based on re-classification (we also report significant dropout rates which skews reserve a cross-validation like mechanism for the more detailed pessimistically the rates reported in Table 2. experiments later on) and are shown in Table 4. What does a difference signify? To answer that, one can take a step backwards and try to answer a simpler question: what does 4.2 Detecting tutor influence a large difference signify? We have elected to brand a difference If we take the data sets that were used in section 4.1 and put as large when the re-classification accuracy of the same module back in the information on which tutor was responsible for for the same year differs by at least 20 percentage points when each student group, we can run the same experiments and try we compare the model predicting the pass/fail result of the first to see whether the tutor attribute will surface in some models stage of the final exam and the corresponding model after a (sample data are shown in Table 5). possible second stage (which is the actual pass/fail grade for the In principle, observing models where the tutor attribute module). In Table 4 such differences are shown in bold. appears near the decision tree root would not be a good thing, There are two issues that become apparent when one views suggesting that a crucial factor in student success is not the Table 4. The first is that whenever we observe an increase in the educational system itself but the tutor. As a matter of fact we model accuracy when switching from the first exam (E) to the can opt to not look for this information at all in the resulting final grade (F), this is associated with senior modules where trees; comparing the accuracies to the ones reported in Table eventual success rates (see Table 2) are substantial. The only 4 should suffice. These results are now shown in Table 6. decrease is observed in a junior year module where success rates are considerably reduced compared to senior year modules. 39 Table 5. An expanded sample training set (see Group). Table 7. Lesion study model accuracies including tutor data. Assgn1 Assgn2 Assgn3 Assgn4 Group Exam 2004-5 2005-6 2006-7 ... ... ... ... ... ... E F E F E F 4.6 7.1 3.8 9.1 Athens-1 PASS INF10 78 78 75 75 77 77 9.1 5.1 4.6 3.8 Patras-1 FAIL INF11 70 74 72 75 71 74 7.6 7.1 5.8 6.1 Athens-2 PASS INF12 71 68 77 69 75 71 INF20 70 65 72 60 82 61 Table 6. Model accuracies including tutor data. INF21 79 68 69 65 69 64 2004-5 2005-6 2006-7 INF22 57 74 61 68 60 65 E F E F E F INF23 N/A 65 74 83 98 INF10 82 83 80 79 82 81 INF24 62 70 66 69 63 66 INF11 75 77 76 78 75 80 INF30 70 82 64 84 65 89 INF12 75 77 81 72 80 72 INF31 63 91 79 91 59 83 INF20 76 72 76 62 87 61 INF35 N/A 66 97 72 91 INF21 84 77 74 74 75 72 INF37 N/A 95 98 91 95 INF22 66 80 68 74 62 75 INF42 N/A N/A 93 100 INF23 N/A 52 82 90 99 INF24 63 69 69 69 66 74 Furthermore, we tried to summarise the results from a INF30 75 82 60 88 75 94 further point of view: that of consistency between the results reported for the E and F columns of both tables. Essentially INF31 67 94 85 93 89 91 we computed the quantity (F5-E5)-(F6-E6) for each module INF35 N/A 72 98 76 90 for each year, where the subscript indicates which table that INF37 N/A 96 100 94 98 particular number was drawn from. Not surprisingly, the two INF42 N/A N/A 96 100 singularities observed were module INF23 for year 2005-6 (with a value of about 20%) and module INF31 for year 2006-7 (with a value of about -22%). This time we observe that the relative difference between the models which utilise the tutor attribute and the ones that do not are quite small. There are some very interesting cases, however. 4.3 Observing the accuracy-size trade-off For example, the INF11 module demonstrates near zero differences throughout. It is interesting to note that this module It is interesting to investigate whether the conventional utilizes a plenary exam marking session, which means that tutors wisdom on model characteristics is valid. In particular, we get to mark exam papers drawn from all groups at random. This analysed the results in Table 6 and in Table 7 with respect to places only marginal administrative overhead and, when viewed whether an increase (or decrease, accordingly) in model from the point of model consistency, seems to be well worth it. accuracy for a particular module for a year was associated Another example is the INF31 module (shown in bold), with a reduction in model size. We say that the model which demonstrated a year where the tutor attribute seemed to be accuracy increases if the accuracy for the E column of that of paramount importance. In that year, the gap between the first year is less that the corresponding number in the F column. exam stage and the final grade seems to be influenced by the For the 68 pairs of numbers reported in Table 6 and in Table tutors. It is now very narrow (89 to 91) while it was quite wide 7 we observed that only in 4 of them did we see the same (68 to 91). This could suggest a relative gap in tutor direction in model accuracy and model size. So, conventional homogeneity. wisdom was confirmed in nearly 95% of the cases. There is one other way to view the importance of the tutor attribute. One can derive a model for one module group and then attempt to use that model as a predictor of performance for the 5 DISCUSSION other module groups (within the same module). This approach, HOU has been the first university in Greece to operate, from while suppressing the tutor attribute, essentially tests its its very first year, a comprehensive assessment scheme (on importance by specifically segmenting the module data set along tutoring and administrative services). Despite a rather hostile groups. The overall accuracy is then averaged over all individual political environment (at least in Greece), quite a few tests. This is the lesion comparison; its results are shown in academic departments have lately been moving along the Table 7. direction of introducing such schemes, though the practice We highlight (in bold) the main difference from the results in has yet to be adapted at a university level. Still, however, Table 4, where it now seems that the gap has been shortened a there is quite a mentality shift required when considering the while. Surprisingly, it suggests an erratic intra-group subtle differences between “measuring” and “assessing”. consistency. Note also, that this particular result in Table 4 was The act of measuring introduces some error in what is the only one not to pass the binary choice (50%) level, which it being measured. If indices are interpreted as assessment only just did in Table 6. indices, then people (actually, any “assessed” subject where people are involved – groups of people, for example) will gradually skew their behaviour towards achieving good measurements. Such behaviour is quite predictably human, of 40 course; the problem is that it simply educates people in the ropes 6 CONCLUSION of the measurement system while sidelining the real issue of We have shown how we have used a combination of genetic improving the educational service. algorithms and decision trees in the context of experimenting By shifting measurement to quantities that are difficult to with how one might setup a quality control system in an “tweak”, one hopes that people whose performance is assessed educational context. will gradually shift form fine-tuning their short-term behaviour Quality control should be a core aspect of any educational toward achieving longer-term goals. Indeed, if people find out system but setting up a system for quality control entails that the marginal gains from fine-tuning their behaviour are too managerial and administrative decisions that may also have to small for the effort expended to achieve them, it will be easier to deal with political side-effects. Deciding how to best and as convince them to improve more fundamental attitudes towards early as possible defuse the potential stand-offs that a quality tutoring (as far as tutors are concerned) or studying (as far as measurement message might trigger calls for the employment students are concerned). of techniques that not only ensure a basic technical soundness In our application, this is demonstrated two-fold. in the actual measurement but also cater to the way the results First, by disseminating tutor group homogeneity indices, one are conveyed and subsequently exploited. This is particularly hopes that, regardless how we call these indices, these tutor so when the application context for the large scale suggests groups will be motivated by peer pressure to consider their that data and models will freely flow amongst thousands of performance vis-a-vis other tutor groups. Even if that may not be tutors and tens of thousands of students. really required, that introspection itself will quite likely improve We have earlier [9] expressed the view that our approach how hat particular tutor group co-operates; at least it will focus is applicable to any educational setting where performance their decisions with respect to why such decisions might measurement can be cast in terms of test (and exam) influence their overall ranking. performance. In the proposed paper we have scaled up our For students, a similar argument applies. Realising that one analysis to cover several modules and years and still believe fits a model which predicts likely failure, even if one knows that that taking the sting out of individual performance evaluation the particular model is known to err quite some times, is but still being able to convey the full message is a key something that will most likely motivate that person to take a component of tutoring self-improvement. Scaling our more decisive approach to studying. For adult students, such a approach to other programmes, other institutions and, even, decisive approach might even mean to drop a course of studying obtaining the approval of our own university for official and or defer studying. This is not necessarily negative, however; consistent reporting of such indices is, however, less of a knowing how to better utilise one’s resources is a key skill in life long learning. technical nature and more of a political exercise. After all we need to persuade people that some innovations are less of a We have selected decision trees because we want to generate threat and more of an opportunity. models that can be effectively communicated to tutors and students alike. We have also selected genetic algorithms to induce the decision trees because we have shown [7] that, for the ACKNOWLEDGEMENTS particular application domain, we can derive small and easy to communicate yet accurate trees. We thus need a hybrid Thanassis Hadzilacos (now at the Open University of Cyprus) approach: rule-based output to be comprehensible and grounded has contributed to this line of research while at Hellenic Open and evolutionary computing to derive this output. University. Which other techniques should one utilise to develop the Anonymized data can be available on request for research models? We cannot fail to note that conventional statistics can purposes only, on a case-by-case basis. be cumbersome to disseminate to people with a background on We acknowledge the advice, from an anonymous reviewer humanities or arts, and this could have an adverse impact on the of the CIMA-ECAI08 workshop, on how to improve the user acceptance of such systems. In that sense, the decision of presentation of this work to reflect the combination of AI whether the models are computed centrally or in a decentralized techniques used. fashion (by devolving responsibility to the tutors, for example) is a key factor. In any case, deploying our measurement scheme in an organization-wide context would also lend support to our REFERENCES initial preference for short models. At the same time, the [1] Mitchell, T. (1997). Machine Learning. McGraw Hill. possibility of a decentralized scheme also suggests that we [2] Quinlan, J.R (1993). C4.5: Programs for machine learning. San should strive to use tools that do not demand a steep learning Mateo, CA: Morgan Kaufmann. curve on the part of the tutors. [3] Kalles, D., & Ch. Pierrakeas (2006). Analyzing student Of course, one can take an alternative course and drop the performance in distance learning with genetic algorithms and requirement that a model has to be communicated. If we only decision trees. Applied Artificial Intelligence, 20(8), pp. 655- focus on the indices then any technique can be used, from neural 674. networks to more conventional ones, such as naive Bayes or [4] Kotsiantis, S., Pierrakeas, C., & Pintelas, P. (2004). Predicting logistic regression [4]. As in all data mining application students’ performance in distance learning using Machine Learning techniques. Applied Artificial Intelligence, 18:5, 411- contexts, it is the application that must drive the techniques to 426. use; for our problem, suffice to note that the comparisons [5] Witten, I., & Frank, E. (2000). Data mining: practical machine reported (Table 4, Table 6 and Table 7) are essentially technique learning tools and techniques with Java implementations. San independent, yet the GATREE approach has proven to-date to be Mateo, CA: Morgan Kaufmann. the best method for prototyping the measurement exercise that [6] Papagelis, A., & D. Kalles (2001). Breeding decision trees using we are developing. evolutionary techniques. Proceedings of the International Conference on Machine Learning, Williamstown, Massachusetts, pp. 393-400, Morgan Kaufmann. 41 [7] Kalles, D., & Ch. Pierrakeas. (2006). Using Genetic Algorithms and Decision Trees for a posteriori Analysis and Evaluation of Tutoring Practices based on Student Failure Models. Proceedings of the 3rd IFIP conference on Artificial Intelligence Applications and Innovations, Athens, Greece, pp. 9-18, Springer. [8] Hadzilacos, Th., Kalles, D. Pierrakeas, Ch., & M. Xenos (2006). On Small Data Sets Revealing Big Differences. Proceedings of the 4th Panhellenic conference on Artificial Intelligence, Heraklion, Greece, Springer LNCS 3955, pp. 512-515. [9] Hadzilacos, Th., & D. Kalles (2006). On the Software Engineering Aspects of Educational Intelligence. Proceedings of the 10th International Conference on Knowledge-Based Intelligent Information & Engineering Systems, Bournemouth, UK, Springer LNCS 4252, pp. 1136-1143. [10] Xenos, M., Pierrakeas, C., & Pintelas, P. (2002). A survey on student dropout rates and dropout causes concerning the students in the Course of Informatics of the Hellenic Open University. Computers & Education, 39, 361-377. 42