Prediction of Student Success Using Enrolment Data Nihat Cengiz Arban Uka Epoka University Epoka University Department of Computer Engineering Department of Computer Engineering Rr. Tiranë-Rinas,Km. 12 Rr. Tiranë-Rinas,Km. 12 1039 Tirana, Albania 1039 Tirana, Albania ncengiz@epoka.edu.al auka@epoka.edu.al ABSTRACT the graduation rate. Preventing students' failure depends on the Predicting the success of students as a function of different identification of the factors affecting success. predictors has been a topic that has been investigated over the Here in this work we will analyze whether the background years. This paper explores the socio-demographic variables like information has any effect on the success rate of regular students. gender, region lived and studied, nationality and high school The only data we collected during the registration period of Epoka degree that may influence success of students. We examine to University based on the registration form. The content of this what extent these factors help us to predict students’ academic form determined by the local authorities and University achievement and will help to identify the vulnerable students and Administration. In this study we tried to get answers if we can use their need for extra tutoring or similar supportive services at an this data to predict student success. The main objective of our early time. study is to determine the factors that may affect the study outcomes in Epoka University. We analyzed the data of the Epoka University students that have been enrolled from 2007 to 2013. The sample includes 1211 2. DATA AND METHODOLOGY undergraduate students where 716 did and were supposed to Epoka University student management system does not provide complete the three-year bachelor studies in the past six semesters. data in the format ready for a direct statistical analysis and modeling. Therefore a data preparation and cleaning were Based on the data mining techniques the most important undertaken to prepare database for modeling. predictors for student success were the students’ high school GPA and gender. For students with high school grades below average, Table Descriptive statistics – Study outcome (716 students) females were found to have a higher percentage of success than Descriptive boys. No significant correlation was found between the students’ count % success and the demographic information. PASS PASS FAIL FAIL Total Keywords Academic achievement, influence, classification tree, outcome Domain M 221 189 53.9 46.1 57.3 GENDER 1. INTRODUCTION F 78 228 25.5 74.5 42.7 Increasing the student graduation and decreasing the dropout rates ALB 238 372 39.0 61.0 85.2 is a long term goal of the higher education institutions. From the COUNTR TUR 35 14 71.4 28.6 6.8 students’ perspective, a timely and successful graduation is vital Y KOS 14 17 45.2 54.8 4.3 as these two factors would strongly affect their employability rate. OTH 12 14 46.2 53.8 3.6 Employability rate has become an indicator in determining the NATION ALB 256 382 40.1 59.9 89.1 ranking of higher education institution (HEI), thus HEIs are ALITY OTH 43 35 55.1 44.9 10.9 focusing more on increasing this rate [2]. CITY 262 372 41.3 58.7 88.5 REGION VILL. 37 44 45.7 54.3 11.3 Many of the students studying at the university face several UPPER 48 224 17.6 82.4 38.0 difficulties during the first year and thus the performance of the HS_GPA INTER. 89 113 44.1 55.9 28.2 first year has been identified as an important predictor of timely LOWER 160 77 67.5 32.5 33.1 graduation rate. In terms of keeping the students in the university, the retention rate is a factor that has been studied extensively. 2.1. Data and Methodology Mallincrodt and Sedlacek (1987) found that freshman class Outcome that we used in our analysis is for the outcome of the attrition rate were greater than the other academic years with student at the end of three-year study. We measured only numbers running up to 30%.[3] Therefore most researchers outcomes, labeled as: Pass and Fail. Students labeled ‘Pass’ targeted the first year students. An early identification of the successfully completed the program at the end of three years. students at high risk of failing will enable a timely intervention Students labeled as ‘Fail’ include the withdrawn students from the with the necessary measures by the educators that would increase program voluntarily or by the academic registry for not fulfilling Almost all growing methods, (CHAID, exhaustive CHAID, CRT the regulations. Those students who stayed on the program until and QUEST) generated exactly the same trees. The largest the end of the study but scored less than the graduation grade successful group consists of 272 (38%) students. HS_GPA of this (2.00) were also allocated into this category. group is over 90%. The largest unsuccessful group contains 237 students (33% of all participants). They have a HS_GPA less than The data set with numeric continuous variable such as secondary 80%. The next largest group considered also as unsuccessful school grade (HS GPA) was converted into a categorical variable students are male students having lower HS_GPA. with only three levels A (UPPER), B (INTERMEDIATE) or C (LOWER) denoting grades above 9 out of 10, grades between 8 As the cross-validation estimate of the risk (0.309) indicates that and 9 and grades less than 8 respectively. Other variables the successful or unsuccessful students are predicted with an error (nationality, citizenship, and region) were classified upon major of 30.9% of the cases which means the risk of misclassifying a groups. student is approximately 31%. This result is consistent with the results in the CHAID classification matrix. The Overall In this study we conducted three main types of data mining percentage shows that the model only classified correctly 70% of approaches. Descriptive approach which concerns the nature of students. The classification tables, however, reveal one potential the dataset such as the frequency table and the relationship problem with this model: for unsuccessful students, it predicts as between the attributes obtained using cross tabulation analysis. successful for only 65.9% of them, which means that 34% of Predictive approach which is conducted by using four different failing students are inaccurately classified with the passing classification trees and a comparison between these and Logistic students. regression to confirm the accuracy of the predictors. Classification tree models can handle a large number of predictor 2.4. Logistic regression variables, are non-parametric, can capture nonlinear relationships The Variables not in the Equation table in block 0 shows that four of the five variables are individually significant predictors of and complex interactions between predictors and dependent whether a student is successful or not. Region is not a significant variable.[1] predictor. The variables not in the Equation table in block 1 shows Before generating the classification trees we classified the that only high school grade point average and gender are significant predictors, but not the other variables. This result also variables according to the study outcome, i.e. whether students are confirms why these two were the only variables used in decision eligible to be graduated or not. We used attribute selection to rank trees the variables by their importance for further analysis. Then we generated the classification trees in four different growing 3. CONCLUSIONS methods. This study examines the background information from enrolment data that impacts upon the study outcome programs at the Epoka 2.2. Summary Data Description University. Based on results, the classification accuracy from the We carried out a cross-tabulation for each variable and the study classification trees was significantly high 71% in all tree methods. outcome after cleaning the data as shown in the table above. Table Although all the variables except the region individually shows that the majority of the successful students are female (over significant predictors as described in attribute selection trees 57%) which is the result of the fact that 74.5% of the female displayed only two variables Gender and secondary school students successfully completed the study. This suggests that degree. This outcome is also confirmed by the logistic regression. female students are more likely to succeed than their male Block 0 classification implied that all except region were good classmates. In terms of country and nationality it is clearly seen predictors (p<,001) but block 1 classification highlighted that only that Albanian population is leading the group. gender and secondary school degree were significant. An expected result has been observed in secondary school degrees. We can say that high school degree graduation ratio is 4. REFERENCES directly proportional to the university graduation ratio. While 82% [1]. Kovačić, Z.J. 2010, Early Prediction of Student Success: of upper students were able to complete the study on time 56% of Mining Students Enrolment Data, proceedings of Informing intermediate and 32% of lower group students were able to Science & IT Education Conference (InSITE) 2010, Open complete. Polytechnic, Wellington, New Zealand 2.3. Decision Trees [2]. Bratti, M., McKnight, A., Naylor, R., & Smith, J. (2004): Higher Education Out-comes, Graduate Employment and Although the results of the attribute selection suggests continuing University Performance Indicators. In: Journal of the Royal analysis with only the subset of predictors, we included all Statistical Society, 167(3), pp 475-496. available predictors in our classification trees but only 2 variables were used in the diagrams: HS_GPA and GENDER. Even though [3]. Mallinckrodt, B., & Sedlacek, W. E. (1987). Student retention and the use of campus facilities by race. NASPA some variables may have little significance to the overall Journal, 24, 28-32. prediction outcome, they can be essential to a specific record [1].