=Paper=
{{Paper
|id=Vol-3051/UGR_1
|storemode=property
|title=Using Course Evaluations and Student Data to Predict Computer Science Student Success
|pdfUrl=https://ceur-ws.org/Vol-3051/UGR_1.pdf
|volume=Vol-3051
|authors=Anlan Du,Alexandra Plukis,Huzefa Rangwala
|dblpUrl=https://dblp.org/rec/conf/edm/DuPR21
}}
==Using Course Evaluations and Student Data to Predict Computer Science Student Success==
Using Course Evaluations and Student Data to Predict Computer Science Student Success Anlan Du∗ Alexandra Plukis∗ Huzefa Rangwala amd5wf@virginia.edu aplukis@asu.edu hrangwal@gmu.edu University of Virginia Arizona State University George Mason University Charlottesville, VA, United States Tempe, AZ, United States Fairfax, VA, United States ABSTRACT can more easily find patterns that reveal how different traits affect As the field of computer science has grown, the question of how to student retention. improve retention in computer science, especially for females and George Mason also offers a unique opportunity to analyze the minorities, has grown increasingly important. Previous research impact of professor gender on student success. George Mason’s has looked into attitudes among those who leave CS, as well as the engineering faculty is 26.8% female, more than 1.5 times higher impact of taking specific courses; we build on this body of research than the national average of 15.7% [10][16]. A larger female faculty using large-scale analysis of course evaluations and students’ aca- means that analyses of the impact of instructor gender are less likely demic history. Our goal is to understand their potential connection to be swayed by a single professor and therefore more statistically to a student’s performance and retention within the CS major. We significant. process course-specific data, faculty evaluations, and student de- mographic data through various machine learning-based classifiers 2 RELATED WORK to understand the predictive power of each feature. We find our algorithm performs significantly better for higher-performing stu- Our work builds upon previous research regarding student both dents than lower-performing ones, but do not find that evaluations college retention and achievement in courses, both generally and significantly improve predictions of students doing well in courses between demographic groups [7] . Demographic disparities are par- and staying in the major. ticularly evident in the number of degrees awarded. For instance, during George Mason University’s 2017-2018 school year only 15.8% KEYWORDS of the total 196 computer science (CS) degrees were awarded to females. This lack of representation is even more pronounced for educational data mining, course evaluations, computer science, minority students—only 6 CS degrees were awarded to African retention, grade prediction, algorithm fairness American students and 16 awarded to their Hispanic counterparts [9]. These disparities have led to a large body of research into reten- 1 INTRODUCTION tion for minorities in STEM and specifically [8][1][15]. Bettinger Among the most important aspects of a college education are the and Long researched the impact of female faculty on female re- classes a student takes. Often, college students use introductory tention in majors or repeated interest in classes and found mixed courses to decide what they would like to study and pursue. Bad ex- results: some disciplines such as statistics and mathematics bene- periences in an introductory course might detract from a student’s fited from an early female professor introduction, while others saw first impression of a field, while a good experience in a course might a decrease in female retention. The authors pointed out that it was improve his or her opinion, even boosting retention and improving difficult to gauge the exact impact of female professors in fields skills upon graduation [13]. Therefore, it is key that administrators that had low levels of females in faculty, such as engineering and and professors alike understand which course characteristics main- physics. We hope to improve upon on this because George Mason’s tain interest and improve student outcomes. Such information can School of Engineering female full time academic faculty make up impact administrative decisions, such as who is assigned to teach 26.8%, far surpassing the national average of 15.7% [16] [10]. particular courses and the recommended sequence of courses. The issue of student performance and retention extends beyond The digitization of student records and course evaluations offers under-represented minorities. Cucuringu et al. used fifteen years a unique opportunity to apply big data modeling techniques to study of student data to find classes that optimized a student’s likelihood retention. George Mason University, the data source for this work, of successfully completing a course of study with high grades [5]. keeps anonymized records on students’ academic records in high They also took the step of segmenting a student population into school, demographic data, and their course loads and grades at the sub-groups based on various characteristics, so as to understand the university. They also administer standardized course evaluations nuances that different types of students might experience. Morsy across all courses. Various data mining and modeling techniques, and Karypis used a similarly broad, qualitative approach to predict such as decision trees and support vector machines, can be applied student performance based on previous classes taken [14]. to these datasets and their results compared. Using this data, one Research specific to CS retention has also been conducted: Big- gers et al. incorporated interviews of students who left CS, seeking ∗ Both authors contributed equally to this research. to find the qualitative sentiments that affected both female and male students’ decisions [2]. We combine these two approaches Copyright © 2021 for this paper by its authors. Use permitted under Creative Commons by using data on students’ individual demographics, grades, and License Attribution 4.0 International (CC BY 4.0). course history to understand how each factor may contribute to Conference’17, July 2017, Washington, DC, USA Du and Plukis, et al. both student performance and choice of major. Additionally, we 1 evaluation available for each unique course GMU offers. This data incorporate student evaluations for the courses they take to un- was collected from the GMU evaluation site 1 , which is publicly derstand the role that these qualitative elements may play in these available while on campus. As these are publicly available docu- outcomes, as suggested by Biggers et al [2]. ments on campus and the identifying features were anonymized, Research on course evaluations suggests they may prove in- they are exempt research under GMU’s IRB policy 2 . To collect data formative with regards to a student’s academic experience. Much on professor gender, we reviewed pronoun usage in departmental research has studied the relationship between the ease of a course, documents and consulted faculty members when documentation often represented by the grade a student receives, and the rating of was insufficient. the faculty. One well-known meta-analysis by Cohen argued that The courses we describe as CS-adjacent are courses taught by or students are fairly accurate in their assessments of instructional in conjunction with the Department of Computer Science at GMU. efficacy [6]. Centra’s study built upon this notion, and further em- These CS-adjacent courses include Information Technology, Com- phasized that students do not give higher evaluations to professors puter Game Design, Software Engineering, Electrical and Computer in a quid pro quo for higher grades: both extremely easy and dif- Engineering, and Information Systems. After discarding course data ficult courses suffered in student evaluations, while courses with with no grades or grades not translating to the A-F scale and ap- appropriate difficulty received the best evaluations [3]. Feldman plying our course filters, we had records for 57,627 student-course analyzed the contributory power of various teacher characteris- enrollments. tics to a teacher’s overall rating and student achievement, finding that preparation, organization, clarity, and students’ feelings of 4.2 Definitions engagement contributed most strongly to overall performance [11]. We frequently discuss “student success” within the computer sci- He also highlighted some myths about student evaluations, citing ence major. In this paper, our definition of “success” is divided into research that suggests that they can, in fact, be informative. We three categories: incorporate evaluations in order to expand on these questions of Completion of a computer science degree: A student is de- student evaluation efficacy, and understand what they say about fined as graduating with a computer science degree if he or she students’ experiences and choices. graduated with a major in either computer science or applied com- puter science. A student is defined as not graduating with a CS 3 PROBLEM DESCRIPTION degree if he or she graduated, but not with a CS major. Because The objective of this study is to investigate a few questions relating we are focused on retention, not graduation, we only included in course quality—defined using faculty traits such as gender and our data students who had had enough time to graduate. By not instructional evaluations—to student retention in computer science. including students who transfer or drop out of GMU, or who simply Specifically, we will address the following inquiries: have not graduated yet, we reduced the number of confounding (1) Which course features, if any, in lower division CS courses variables that are not directly related to students’ experiences in improve graduation retention for students? CS. (2) Which course features, if any, of instructors in introductory Fulfilment of a student’s potential in a course: A student’s CS courses can predict student success in future CS courses? “potential” in CS211 is defined as the term GPA of the semester (3) Do non-CS courses that are required by CS majors, like in which CS112—the direct pre-requisite—was taken. Our interest calculus, have an impact on major retention for students? If in this stems from its potential in combination with predictions so, which courses and features have the largest impact? of passing a course. Students who perform below their “potential” within CS211, despite passing and receiving credit for the course, 4 MATERIALS might still benefit from administrator involvement. Alternatively, the characteristics of students performing above their potential may 4.1 Dataset highlight positive factors that should continue to be proliferated Our dataset consisted of records containing first time freshman on an institutional level. student enrollment and course evaluation data for 20,825 George Passing a course for credit: A student is defined as passing Mason students over the span of eight years, from Summer 2009 a course for credit if he or she receives a C grade or above. At to Fall 2018. All student data were collected and anonymized in GMU, computer science BS students “must earn a C or better in accordance with GMU’s Institutional Review Board policies. The any course intended to satisfy a prerequisite for a computer science student data contained demographics data such as age, sex, and course ... [s]tudents may attempt an undergraduate course taught race; admissions data such as high school, SAT score, and high by the Volgenau School of Engineering twice.” 3 . In our research, school GPA; and course data such as declared major, graduation we specifically target student success in CS112 and CS211 because year, courses taken, and grades received. Students who transferred they are required courses for CS/ACS majors and pre-requisites for into GMU were not included in this dataset because they likely all other programming courses. Figure 1 visualizes the contrast in had completed introductory courses at their previous institutions, pass rates for first and second attempts in CS211: within our dataset, rendering that first-year data inaccessible to us. We also collected course evaluation data on 87,629 GMU courses from Summer 2009 1 https://irr2.gmu.edu/ 2 https://rdia.gmu.edu/topics-of-interest/human-or-animal-subjects/human- to Spring 2019, 8,243 of which were computer science, or computer subjects/exempt-research/ science-adjacent courses. The evaluations are averages of all of the 3 https://catalog.gmu.edu/colleges-schools/engineering/computer-science/computer- student evaluations for that specific course and section, so there is science-bs/#admissionspoliciestext Using Course Evaluations Conference’17, July 2017, Washington, DC, USA only 19.8% of students attempting CS211 for the first time did not with 7,602 students. Lastly, we dropped all students with empty receive credit, versus 63.3% of students on their second attempt. values for any of the columns used in training. This left a dataset of 1,476 students who took at least one CS or CS-adjacent class before graduating. Of those, 330 graduated with a CS or ACS major, or 22.35%. This left us with an imbalanced dataset, leading to our decision to use F1 score and ROC AUC to characterize our models. For the grade prediction portion, students who received no grade—meaning they audited or did not complete the class—were not included in the data. This left 1,728 students who took both CS112 and CS211 at GMU at least once. In the cases where students took these courses multiple times, only the initial course attempt was used so as to only capture their original experience in the class. Predicting grades for only first attempts of CS211 offers an earlier flagging system for at-risk students. We wanted to understand the impact that not only general in- Figure 1: Receiving Credit for CS211. structor qualities, but also “exemplary” instructors, had on student grades. To that end, each grade prediction model was run with 5 METHODS the course evaluations processed in one of two ways: percentiles For this work, we compared performance of predictive models that or flags. Percentiles, which capture the general quality of an in- were trained on three different sets of data, which are fully described structor, had each evaluation entry into a percentile relative to the in Appendix 9.2: other courses. Flags, which served to identify exemplary instruc- tors, transformed each entry into a binary feature based on whether (1) Baseline predictions based on high school performance and it was in the top 10% of evaluation scores in that category. student demographic data, as well as basic course informa- Although evaluations offer more data than can usually be gleaned tion such as the term in which a course was taken and a from student records, we tried to capture the elements in a course student’s GPA in that term. that cannot be captured in evaluations or records. We did so by cre- (2) Baseline features in addition to instructor gender and course ating unique course IDs for each course, so as to highlight especially evaluations for the classes, either CS-only or math and CS, good courses, good times of day for students, and good connec- taken by each student. tions between students in courses—all of which are not explicitly (3) Baseline features, plus the course numbers as unique iden- quantified in our data. tifiers that were distinct for each section and semester of a class, but common to all of the students who took that section. 5.2 Experiments We chose to use machine learning classifiers because they can often As mentioned previously, we had three main groups of datasets. pick up on more intricate patterns and correlations than linear and The second group, which includes the course evaluation data, was other basic statistical models can would. We decided to test these then run on three different subsets: first, it was trained with just distinct data sets because they each highlight a component of stu- the "overall teaching" and "overall course" evaluation scores for dent courses that may be significant to students’ performance and the first CS and math courses, then the overall evaluations from ultimate retention. The full list of features used in each experiment the first two courses, then all available course evaluation metrics are described in Appendix 9. for the first two courses in each area. For graduation prediction, We used seven classifiers from the Python sci-kit learn library: both math and CS courses were included in the evaluation data in Random Forest, Gradient Boosting, AdaBoost, SVC, Decision Tree, order to capture a full snapshot of introductory courses. For grade Neural Net, and Naive Bayes. For each of these models, we per- prediction, only CS courses are included so as to not diminish the formed 5-fold cross-validation, recording the resulting the averages dataset of non-CS or non-STEM students, who often do not have and standard deviations. In order to account for imbalances in our the same rigorous math requirements. dataset, we decided upon area under an ROC curve (ROC AUC) Our rationale in deploying some tests with just two course eval- and F1 score as our main metrics, because they take into account uation features per course was that the added dimensionality of precision and recall in addition to overall accuracy. running the models on all of the features (many of which were positively correlated) might hinder performance. The baseline was 5.1 Pre-Processing meant to be the control for the predictive capabilities of only ba- We consolidated student data for all students who took at least one sic course features and student demographic information, so that CS class, of whom there were 15,552. To better incorporate summer subsequent tests might reveal how much predictive power the addi- student data, we moved summer courses to the proceeding fall term. tional data might have added. The full list of features used for each Then, we calculated percentile values for students’ SAT scores and of these experiments are listed in Appendix 9.1. high school GPAs, enabling us to compare these metrics along All of our experiments deal with binary classification, and as such a standard scale of 0 to 1. Next, for models predicting retention, require binary flagging for the classes of interest. In grade prediction we removed all students who had not yet graduated, leaving us experiments, those who are at risk—of either not receiving credit Conference’17, July 2017, Washington, DC, USA Du and Plukis, et al. or not fulfilling their potential in a course—are flagged with a 1. In (3) Fairness: Comparison between prediction of each academic the graduation predictions, students who graduate with a computer quartile versus prediction of all students science or applied computer science degree are flagged with a 1. For each experiment, we ran 5-fold cross validation on our mod- 6.1 Baseline Performance els, using a deterministic seed to generate our training-testing splits Tables 1 and 2 show the baseline ability of each machine learning so that we could directly compare splits before and after the mod- model to predict student success without any course evaluation els were trained. We performed Student’s t-tests on our results to data. These models were trained and tested on only basic course understand the significance of any differences in performance. features, such as the term taken and number of students in the class, and student demographics. 5.3 Fairness In order to check that the predictions were not favoring certain Passing Potential students already predisposed to graduating with a CS degree or Classifier F1 AUC Acc F1 AUC Acc passing their courses, we decided to separate the students into Gradient 0.666 0.869 0.814 0.790 0.770 0.721 groups based on their academic abilities coming into college. We Boosting ±0.053 ±0.034 ±0.029 ±0.014 ±0.012 ±0.014 consider a prediction algorithm to be fair if its F1 score remains Random 0.640 0.865 0.808 0.800 0.776 0.730 statistically similar regardless of the student’s quartile standing. Forest ±0.046 ±0.030 ±0.019 ±0.012 ±0.016 ±0.015 We used high school GPA (HS GPA) and total SAT scores to have AdaBoost 0.641 0.849 0.803 0.777 0.746 0.710 one metric of school success and one metric of testing success to ±0.051 ±0.035 ±0.021 ±0.024 ±0.025 ±0.022 create a fuller understanding of student academic ability. These Decision 0.637 0.825 0.798 0.787 0.737 0.710 two scores were transformed into percentiles, averaged together, Tree ±0.050 ±0.034 ±0.020 ±0.021 ±0.024 ±0.024 Neural 0.623 0.853 0.795 0.779 0.764 0.712 then transformed into a percentile once more. This final percentile ±0.064 ±0.032 ±0.032 ±0.007 ±0.027 ±0.010 calculation divided the students into evenly sized groups. SVC 0.598 0.835 0.794 0.782 0.743 0.697 The students were then separated into 4 groups based on their ±0.049 ±0.037 ±0.021 ±0.009 ±0.032 ±0.014 percentile standings, as pictured in Figures 3 and 4. To test the Naive 0.488 0.783 0.719 0.029 0.683 0.383 fairness implications, 5-fold splits were trained on all students and Bayes ±0.217 ±0.035 ±0.024 ±0.026 ±0.034 ±0.009 then tested only on certain quartiles. This way, we could clearly Table 1: Predicting CS211 success—passing the class or see any disparity in performance for all students versus those in achieving one’s "potential" grade—from only student demo- separate groups of students. graphics and basic course features. Graduating Classifier F1 AUC Acc Gradient 0.533 0.855 0.824 Boosting ±0.037 ±0.007 ±0.011 Random 0.447 0.837 0.825 Forest ±0.041 ±0.009 ±0.011 AdaBoost 0.563 0.839 0.823 ±0.047 ±0.028 ±0.023 Decision 0.473 0.662 0.759 Figure 2: High School GPA versus SAT Total score for all non- Tree ±0.018 ±0.012 ±0.015 transfer students who took both CS112 and CS211 at GMU. Neural 0.486 0.762 0.735 ±0.036 ±0.018 ±0.043 SVC 0.0 0.770 0.776 We used these quartiles to test for fairness by training each of ±0.0 ±0.030 ±0.000 our models on the full datasets, splitting up the testing sets based Naive 0.460 0.785 0.515 on the quartiles, and calculating the metrics based on these results. Bayes ±0.009 ±0.029 ±0.018 We then compared these quartile results with the results for all Table 2: Predicting a CS211 success measure—graduating students to determine if there was a significant difference between with a CS degree—from only student demographics and ba- them, and therefore a disparity in fairness for differing groups. sic course features. 6 RESULTS Our results are divided into three sections: (1) Performance metrics (F1 Score, ROC AUC, Accuracy) for our 6.2 Effect of Including Evaluation Data baseline models; Tables 3, 4, and 5 assess the difference in performance between the (2) Comparison between baseline models and models that in- baseline models and those that incorporated evaluation and course clude course evaluation and other instructor data; data. The smallest p-values are in bold. Using Course Evaluations Conference’17, July 2017, Washington, DC, USA Percentiles Flags Experiment t Statistic p-value t Statistic p-value t Statistic p-value 1 Overall Eval -0.6515 0.5334 1 Overall Eval -0.0319 0.9754 0.6862 0.5120 2 Overall Evals -0.7630 0.4781 All 2 Overall Evals 0.6796 0.5176 0.5370 0.6059 2 Full Evals 0.2207 0.8318 All 2 Full Evals 0.1160 0.9105 0.1773 0.8637 Discrete IDs -0.5975 0.5669 Discrete IDs 0.3738 0.7191 0.3738 0.7191 1 Overall Eval 0.2620 0.8008 1 Overall Eval 0.1700 0.8696 -0.1004 0.9227 2 Overall Evals 0.2671 0.7964 Q1 2 Overall Evals -0.3412 0.7425 0.4117 0.6930 2 Full Evals 0.4025 0.6982 Q1 2 Full Evals -0.0476 0.9632 0.0834 0.9357 Discrete IDs 0.3470 0.7385 Discrete IDs -0.0594 0.9540 -0.0594 0.9541 1 Overall Eval -0.5466 0.5996 1 Overall Eval -0.2578 0.8034 -0.2765 0.7894 2 Overall Evals -0.7459 0.4787 Q2 2 Overall Evals 0.01077 0.9917 -0.3237 0.7549 2 Full Evals -0.3600 0.7284 Q2 2 Full Evals -0.1091 0.9161 -0.2489 0.8097 Discrete IDs -0.1730 0.8670 Discrete IDs 0.0749 0.9421 0.0749 0.9421 1 Overall Eval -0.7931 0.4557 1 Overall Eval -0.3398 0.7430 -0.1586 0.8784 2 Overall Evals 0.0139 0.9893 Q3 2 Overall Evals 0.0113 0.9913 -0.0904 0.9302 2 Full Evals -0.1678 0.8714 Q3 2 Full Evals 0.0334 0.9742 -0.0587 0.9551 Discrete IDs 0.2165 0.8352 Discrete IDs 0.4269 0.6840 0.4269 0.6840 1 Overall Eval -0.2212 0.8311 1 Overall Eval -0.6431 0.5382 -1.1410 0.2892 2 Overall Evals 0.4312 0.6778 Q4 2 Overall Evals 0.6485 0.5379 0.5716 0.5864 2 Full Evals 0.5631 0.5889 Q4 2 Full Evals 0.1852 0.8585 0.5546 0.5943 Discrete IDs -0.5019 0.6293 Discrete IDs -0.3446 0.7404 -0.3446 0.7404 Table 5: Experimental models’ performance in predicting re- Table 3: Experimental models’ performance in predicting tention in the CS major, versus baseline models. whether students passed CS211, versus baseline models. In all of these t-tests, our null hypothesis was that evaluations Percentiles Flags and specific courses taken by a student do not improve student t Statistic p-value t Statistic p-value success predictions. If this were true, results from the baseline set of 1 Overall Eval -0.1432 0.8898 0.5000 0.6352 data would be the same as results that included course information 2 Overall Evals 1.1452 0.2863 -0.5726 0.5831 because the course information would add no predictive power. All 2 Full Evals 0.4472 0.6675 -0.6202 0.5549 None of our experiments proved to have a significant improvement Discrete IDs 1.5110 0.1695 1.5110 0.1695 over our baseline, so we fail to reject our null hypothesis and do 1 Overall Eval -0.0160 0.9877 -0.1531 0.8822 not find that evaluations improve predictions of student success. 2 Overall Evals 0.2992 0.7728 0.0360 0.9722 Q1 2 Full Evals -0.2371 0.8186 0.0559 0.9569 Discrete IDs 0.3335 0.7475 0.3335 0.7475 6.3 Fairness Across Student Quartiles 1 Overall Eval 0.2996 0.7724 -0.0688 0.9469 Tables 6, 7, and 8 show fairness t-tests. These are tests of whether 2 Overall Evals 0.7417 0.4804 -0.1990 0.8473 the performance of each experimental model is better or worse at Q2 2 Full Evals 0.1707 0.8688 0.2541 0.8061 predicting results for a specific quartile, versus predicting results Discrete IDs 1.1216 0.2947 1.1216 0.2947 for all students. They capture the statistical significance of discrep- 1 Overall Eval 0.1039 0.9199 0.1922 0.8526 ancies in performance when run on different groups of students. 2 Overall Evals 0.5626 0.5894 0.1778 0.8637 The null hypothesis in these tests is that there is no difference Q3 2 Full Evals 0.7600 0.4716 -0.2082 0.8403 Discrete IDs 1.1218 0.2946 .1218 0.2946 between the F1 scores for all students and those of each quartile. 1 Overall Eval -0.3295 0.7502 -0.4261 0.6815 In other words: the null hypothesis is that the predictions are fair. 2 Overall Evals 0.2247 0.8279 -0.1384 0.8936 The lowest p-scores we found are in bold or, if they are statistically Q4 2 Full Evals 0.3297 0.7512 -0.5162 0.6203 significant, are highlighted. Discrete IDs 0.2808 0.7862 0.2808 0.7862 Table 6 shows the models’ fairness in predicting whether stu- Table 4: Experimental models’ performance in predicting dents passed CS211. whether students achieved their "potential" grade in CS211, Table 7 shows fairness in predicting whether students achieved versus baseline models. their potential grades. This table differs much from Table 6 in that many of the p-values listed here are significant at the 0.05 level. All of the significant results are clustered within the first and second quartiles, which are the bottom two quartiles in our groupings. The Percentiles column indicates evaluation scores were con- Table 8 shows models’ fairness in predicting whether students verted to percentiles; the Flags column indicates binary flags of graduate with a CS major. While the significant t statistics in Table the top 10% of scores were used.4 7 were positive—indicating that the models perform best on the 4 Note that for models using Discrete IDs, we do not use numerical evaluation data, so first and second quartiles—we see that performance for the lower there is no distinction between the two categories’ results. two quartiles is negative. Additionally, t statistics are significantly Conference’17, July 2017, Washington, DC, USA Du and Plukis, et al. Percentiles Flags type t Statistic p-value t Statistic p-value t Statistic p-value Baseline -5.7063 0.0019 Baseline 1.1476 0.2884 1.1476 0.2884 1 Overall Eval -3.4333 0.0195 0.9143 0.3899 0.6463 0.5443 -4.4204 0.0106 Q1 1 Overall Eval 2 Overall Evals 1.3646 0.2097 1.2095 0.2610 -6.6634 0.0013 Q1 2 Overall Evals 2 Full Evals 2 Full Evals 1.1995 0.2669 1.4597 0.1896 Discrete IDs -3.6290 0.0178 Discrete IDs 1.0130 0.3471 1.0130 0.3471 Baseline -2.5044 0.0528 Baseline -0.2019 0.8475 -0.2019 0.8475 1 Overall Eval -2.7323 0.0380 -0.4798 0.6519 -0.2687 0.7957 -4.3013 0.0106 Q2 1 Overall Eval 2 Overall Evals -0.3005 0.7746 -0.3801 0.7179 -3.6916 0.0161 Q2 2 Overall Evals 2 Full Evals 2 Full Evals -0.1890 0.8568 -0.4047 0.6965 Discrete IDs -2.7675 0.0381 Discrete IDs 0.3129 0.7640 0.3129 0.7640 Baseline -0.6357 0.5498 Baseline 0.4792 0.6448 0.4792 0.6448 1 Overall Eval -1.4485 0.1859 0.3567 0.7316 0.3720 0.7197 -0.4967 0.6411 Q3 1 Overall Eval 2 Overall Evals 0.2940 0.7772 -0.0392 0.9700 -1.3347 0.2360 Q3 2 Overall Evals 2 Full Evals 2 Full Evals 0.5200 0.6172 0.2183 0.8327 Discrete IDs -0.1085 0.9166 Discrete IDs 0.5267 0.6133 0.5267 0.6133 Baseline 2.4488 0.0421 Baseline -0.6228 0.5560 -0.6228 0.5560 1 Overall Eval 3.1471 0.0144 -0.3376 0.7453 -0.6068 0.5612 4.4240 0.0072 Q4 1 Overall Eval 2 Overall Evals -0.3570 0.730423 3.1356 0.0225 Q4 2 Overall Evals -1.3819 0.2078 2 Full Evals 2 Full Evals -0.6432 0.5467 -0.5803 0.5793 Discrete IDs 2.4960 0.0409 Discrete IDs -0.7653 0.4773 -0.7653 0.4773 Table 8: Fairness in experimental models’ predictions of Table 6: Fairness in experimental models’ predictions of whether students graduated with a CS major. whether students passed CS211. Percentiles Flags t Statistic p-value t Statistic p-value 7 DISCUSSION Baseline 2.5055 0.0557 2.5055 0.0557 7.1 Improvements Upon the Baseline 1 Overall Eval 1.9566 0.1005 1.7817 0.1338 Overall, our results show that adding evaluations to predictions 1.9229 0.1069 2.9422 0.0278 Q1 2 Overall Evals 2 Full Evals 2.0311 0.0960 2.540979 0.0462 does not significantly improve predictions over the baseline of Discrete IDs 2.0484 0.0858 2.0484 0.0858 student demographics and basic course features. The addition of Baseline 3.1270 0.0267 3.1270 0.0267 instructor gender, too, was not significant. Even when gender was 1 Overall Eval 2.6275 0.0445 2.8256 0.0371 included for all courses used in predictions, the results did not 3.2574 0.0162 4.4184 0.0045 improve drastically. However, there were many semesters for which Q2 2 Overall Evals 2 Full Evals 4.5984 0.0019 3.2834 0.0196 there were no female instructors available at all to teach a course, so Discrete IDs 3.5205 0.0134 3.5205 0.0135 there could not be any direct comparisons between students with Baseline 0.0994 0.9246 0.0994 0.9246 male instructors and those with female instructors. Although there 1 Overall Eval 0.4097 0.6943 0.3179 0.7602 are slight improvements for some experiments beyond the baseline, 0.3143 0.7647 -0.1453 0.8895 Q3 2 Overall Evals notably those involving the discrete and continuous unique IDs, 2 Full Evals 0.2490 0.8111 0.6658 0.5346 they do not reach a significance level of 0.05. This suggests that the Discrete IDs 0.3603 0.7305 0.3603 0.7305 impact of student evaluation data and instructor gender on student Baseline -2.0964 0.1004 -2.0964 0.1004 1 Overall Eval -2.4168 0.0650 -2.4568 0.0627 performance is not immediately visible. -2.5259 0.0556 -2.3934 0.0695 Generally, our experiments performed better when predicting Q4 2 Overall Evals 2 Full Evals -2.1430 0.0907 -2.3262 0.0736 whether students would achieve their "potential" grades than whether Discrete IDs -2.5112 0.0585 -2.5112 0.0585 they would receive credit. When comparing p-values between pass- Table 7: Fairness in experimental models’ predictions of ing and potential for all experiments, as in Tables 4, 5, and 6, pre- whether students achieved their "potential" grade in CS211. dicting student potential seems to improve upon predicting passing even when compared to their respective baselines. We attribute this to the fluid nature of a student’s forecast grade: if a students forecast grade is a C, then there are 3 possible grades that this better for students in the top quartile. This suggests significant student could get and still fulfill at or above his or her potential. fairness disparities in these prediction models: the F1 scores of Similarly, for students whose predicted grades are A’s, there is only prediction across the entire student body overlap strongly with the one possible grade, an A, with which they can achieve at or above F1 scores across the third quartile, but vary widely from those of their potential. This imbalance on both sides of the forecast grades both stronger and poorer overall performers. This is in spite of the means the models can make an easier prediction of achieving below quartiles being represented in the overall dataset in equal numbers a potential grade because there are generally more options on the of data points. lower end of the grade scale than on the higher end. Using Course Evaluations Conference’17, July 2017, Washington, DC, USA In addition, because the evaluations we have access to are only majors. Given that GMU has unique student body, with many trans- averages for all students in a course and do not reflect each stu- fer students and non-traditional graduates, we would like to also dent’s personal evaluation of the professor, each student in a section include these students in a future analysis to track differences in of a course would have the same evaluations. This large amount their progressions through their majors. This also begs the question of overlap between students, who then experienced different out- of whether our results would be different at a school with more comes with their success, seemed to negatively impact the models four-year students. Despite the fact that our data does not indicate in experiments where full sets of evaluations were used. This is- that evaluations can improve predictions of student success, we are sue was slightly assuaged with the use of unique IDs, but not at a interested in the outcomes of research into this avenue at schools significant level for most quartiles, see Tables 5, 6, and 7. For this rea- with differing evaluation styles to see if these results can be im- son, evaluation sets where evaluations are unique to each student proved upon. In addition, the fairness concerns raised in this paper would provide and interesting contrast to this work—individualized around differing performances for students with varying academic evaluations might provide high quality features for prediction. statuses are of concern. We would like to see the improvement of grade prediction techniques both for all students and for each 7.2 Fairness quartile or minority demographic. Our fairness results reach significance levels especially often in the third and fourth quartiles—see Tables 8, 9, and 10—which are ACKNOWLEDGMENTS the lower academic quartiles. These quartile models underperform This work was funded by the National Science Foundation as a against the models for all students, frequently significantly. This part of GMU’s Computer Science Research Experience for Under- is a cause for concern–the students for which our models predict graduates, and graciously supported by Dr. Karen Lee and GMU’s well on, quartiles 1 and 2, are the quartiles in which students often OSCAR office. We would also like to thank Drs. Mark Snyder and already perform well. There are a few reasons that might contribute Huzefa Rangwala for their guidance and knowledge. Finally, we to this underperformance on the lower quartile of students. One is thank GMU for the use of their facilities and this opportunity. that our quartiles are artificially created—although high school GPA and SAT scores are indicators of academic success in high school, REFERENCES they do not necessarily represent the same success in college. In [1] E. P. Bettinger and B. T. Long. 2005. Do Faculty Serve as Role Models? The Impact addition, we split students into quartile depending on the percentile of Instructor Gender on Female Students. The American Economic Review 95, 2 (2005), 152–157. http://www.jstor.org/stable/4132808 of their averaged HS GPA and SAT scores, not on any visible clusters [2] Maureen Biggers, Anne Brauer, and Tuba Yilmaz. 2008. Student Perceptions of within the data. These artificial clusters might not represent true Computer Science: A Retention Study Comparing Graduating Seniors with Cs student groups. Leavers. SIGCSE Bull. 40, 1 (mar 2008), 402–406. https://doi.org/10.1145/1352322. 1352274 [3] John A. Centra. 2003. Will Teachers Receive Higher Student Evaluations by 8 CONCLUSION Giving Higher Grades and Less Course Work? Research in Higher Education 44, 5 (2003), 495–518. https://doi.org/10.1023/A:1025492407752 Our data suggests there is a pressing need to understand how stu- [4] Nitesh V. Chawla, Kevin W. Bowyer, Lawrence O. Hall, and W. Philip Kegelmeyer. dents of different academic calibers experience the same curricula, 2002. SMOTE: synthetic minority over-sampling technique. Journal of artificial intelligence research 16 (2002), 321–357. given the disparities in their ultimate outcomes. It also suggests that [5] Chantal Cherifi, Hocine Cherifi, MÃąrton Karsai, and Mirco Musolesi (Eds.). 2018. evaluations of courses, at least as they are structured in our data, Complex Networks & Their Applications VI: Proceedings of Complex Networks 2017 (The Sixth International Conference on Complex Networks and Their Applications). do not offer significant insights as to how a student will perform Studies in Computational Intelligence, Vol. 689. Springer International Publishing, or whether he or she will remain in computer science. Lastly, we Cham. https://doi.org/10.1007/978-3-319-72150-7 find that the starting course number or code may have some pre- [6] Peter A. Cohen. 1981. Student Ratings of Instruction and Student Achievement: A Meta-Analysis of Multisection Validity Studies. Review of Educational Research dictive power, suggesting that different courses may significantly 51, 3 (1981), 281–309. https://doi.org/10.2307/1170209 impact the outcomes of students. The question now becomes one of [7] Paulo Cortez and Alice Silva. 2008. Using data mining to predict secondary school identifying how we measure the different features of these courses. student performance. EUROSIS (01 2008). [8] Benjamin J. Drury, John Oliver Siy, and Sapna Cheryan. 2011. When Do Female There are several possible expansions on our methodology. As Role Models Benefit Women? The Importance of Differentiating Recruitment previously mentioned, our data is imbalanced, and using techniques From Retention in STEM. Psychological Inquiry 22, 4 (2011), 265–269. https: //doi.org/10.1080/1047840X.2011.620935 such as cost-aware training, oversampling, undersampling, or Syn- [9] Office of Institutional Effectiveness and Planning. 2019. Degrees Conferred By thetic Minority Oversampling Technique [4] might enable a more Degree And Demographic - Year 2017-18, All Terms. http://irr2.gmu.edu/New/N_ balanced weighting of results and greater accuracy in identifying Degree/DegDegreeDetail.cfm [10] Office of Institutional Effectiveness and Planning. 2019. Full-Time Academic points in the minority classes. Another proposed fairness-oriented Faculty Demographic Profiles Two-Year Comparisons. http://irr2.gmu.edu/New/N_ metric is the Absolute Between-ROC Area metric, which measures Faculty/FullTimeFacComp.cfm the absolute area between two ROC curves. In doing so, it measures [11] Kenneth A. Feldman. 1996. Identifying exemplary teaching: Using data from course and teacher evaluations. New Directions for Teaching and Learning 1996, disparities in prediction across every possible decision threshold, 65 (Mar 1996), 41–50. https://doi.org/10.1002/tl.37219966509 as opposed to just one[12]. Lastly, we would like to grid-search for [12] Josh Gardner, Christopher Brooks, and Ryan Baker. 2019. Evaluating the Fairness of Predictive Student Models Through Slicing Analysis. In Proceedings of the 9th hyperparameters that optimize F1 score and ROC AUC, rather than International Conference on Learning Analytics & Knowledge (LAK19). ACM, New accuracy. York, NY, USA, 225–234. https://doi.org/10.1145/3303772.3303791 In addition to course evaluations for computer science classes, [13] Jeff Kastner Gregory Warren Bucks, Kathleen A. Ossman and F James Boerio. 2015. First-year Engineering Courses’ Effect on Retention and Workplace Perfor- we also scraped course evaluations for other classes. In the future, mance. In 2015 ASEE Annual Conference & Exposition. ASEE Conferences, Seattle, we hope to use this dataset to apply such retention analysis to all Washington. https://peer.asee.org/24114. Conference’17, July 2017, Washington, DC, USA Du and Plukis, et al. [14] Sara Morsy and George Karypis. 2017. Cumulative knowledge-based regression models for next-term grade prediction. Proceedings of the 17th SIAM International Conference on Data Mining, SDM 2017 (jan 2017), 552–560. http://www.scopus. com/inward/record.url?scp=85027876583&partnerID=8YFLogxK [15] Laird Townsend. 1994. How Universities Successfully Retain and Graduate Black Students. The Journal of Blacks in Higher Education 4 (1994), 85–89. http: //www.jstor.org/stable/2963380 [16] Brian L. Yoder. 2017. Engineering by the Numbers. Engineering College Pro- files & Statistics Book (2017). http://www.asee.org/papers-and-publications/ publications/college-profiles/15EngineeringbytheNumbersPart1.pdf Using Course Evaluations Conference’17, July 2017, Washington, DC, USA A PREDICTION FEATURES Table 9 shows the features used for each of the predictions, categorized by the type of experiment being described. Features Meaning Baseline Overall SET All SET IDs Categorical variable, includes option for no race Race ✕ ✕ ✕ ✕ listed. Categorical variable: male, female, and no gender Sex ✕ ✕ ✕ ✕ listed. High School GPA Continuous variable. ✕ ✕ ✕ ✕ Continuous variable, out of 1600. Empty cells are SAT Total Score ✕ ✕ ✕ ✕ filled in with the median of the total SAT scores. Continuous variable, out of 800. Empty cells are SAT Verbal Score filled in with the median of the total SAT verbal ✕ ✕ ✕ ✕ scores. Continuous variable, out of 800. Empty cells are SAT Math Score filled in with the median of the total SAT math ✕ ✕ ✕ ✕ scores Continuous variable, between 0 and 1. Average of Average Percentile the HS GPA and SAT Total percentiles for each ✕ ✕ ✕ ✕ student. Continuous variable, indicates the term in which Class Term Taken the student took the course used for prediction and ✕ ✕ ✕ ✕ the course being predicted. Continuous variable, the non-cumulative GPA for Term GPA the term in which the student took the course used ✕ ✕ ✕ ✕ for prediction and the course being predicted. Instructor Gender Binary variable, split between male and female. ✕ ✕ ✕ Continuous variable, the grade received in the Grade Points ✕ ✕ ✕ ✕ course used for predicting the second course. Continuous or binary, depending on the treatment of the specific test—flagging or percentiles. These Overall Evaluations ✕ ✕ are defined as SET (as seen in Appendix 9.2) ques- tions 15 and 16. Continuous or binary, depending on the treatment of the specific test—flagging or percentiles. These All Evaluations ✕ are defined as SET (as seen in Appendix 9.2) ques- tions 1 through 14. Binary, represents the unique course taken by a student: ID is discipline, course number, section Course ID number, term taken, and binary digit indicating ✕ a summer term. Students in the same course and section will all have a 1. Table 9: The features used for each experiment. B GMU’S STUDENT EVALUATION OF TEACHING (SET) Each of these sections were rated on a scale of 1 to 5, with a NA option available. Questions 15 and 16 are the “overall” evaluations used in certain experiments. (1) Course requirements and expectations were clear. (2) The course was well organized. (3) The instructor helped me to better understand the course material. (4) Feedback (written comments and suggestions on papers, solutions provided, class discussion, etc.) was helpful. (5) The instructor showed respect for the students. (6) The instructor was accessible either in person or electronically. (7) The course grading policy was clear. (8) Graded work reflected what was covered in the course. (9) The assignments (projects, papers, presentations, etc.) helped me learn the material. (10) The textbook and/or assigned readings helped me understand the material. Conference’17, July 2017, Washington, DC, USA Du and Plukis, et al. (11) Assignments and exams were returned in a reasonable amount of time. (12) The instructor covered the important aspects of the course as outlined in the syllabus. (13) The instructor made the class intellectually stimulating. (14) The instructor encouraged the students to be actively involved in the material through discussion, assignments, and other activities. (15) My overall rating of the teaching. (16) My overall rating of this course. C EXTENDED GRADE PREDICTION RESULTS Percentiles Top 10% Flags type model F1 AUC Acc F1 AUC Acc 1 overall Gradient 0.665 0.872 0.809 0.689 0.875 0.826 Boosting ±0.046 ±0.034 ±0.029 ±0.053 ±0.032 ±0.032 Passing 2 overall Gradient 0.686 0.880 0.822 0.684 0.874 0.821 Boosting ±0.039 ±0.032 ±0.022 ±0.053 ±0.033 ±0.030 2 full Gradient 0.670 0.876 0.812 0.672 0.875 0.815 Boosting ±0.056 ±0.034 ±0.027 ±0.054 ±0.032 ±0.029 1 overall Random 0.799 0.770 0.728 0.803 0.773 0.734 Forest ±0.010 ±0.018 ±0.013 ±0.006 ±0.017 ±0.008 Potential 2 overall Random 0.808 0.787 0.741 0.796 0.773 0.725 Forest ±0.010 ±0.018 ±0.013 ±0.010 ±0.017 ±0.014 2 full Random 0.804 0.788 0.739 0.796 0.771 0.722 Forest ±0.016 ±0.030 ±0.019 ±0.008 ±0.013 ±0.010 Table 10: Highest performing models from each of the experiments, evaluation treatments, and grade prediction styles. The best performers in each grade prediction style block are highlighted. The experiment that improved upon the baseline power of prediction most utilizes unique course IDs to represent individual courses taken. The results of this type of experiment are displayed in Table 2. Table 2 contains the results of the ID experiments predicting student grades. The top performing models in Table 2 outperform the baseline predictive powers in F1, AUC, and accuracy measures, and the significance of these experiments is explored in Table 4. Passing Potential Classifier F1 AUC Acc F1 AUC Acc Gradient Boosting 0.677 0.876 0.821 0.811 0.801 0.748 ±0.039 ±0.029 ±0.015 ±0.013 ±0.015 ±0.017 AdaBoost 0.666 0.858 0.810 0.794 0.788 0.736 ±0.039 ±0.024 ±0.018 ±0.015 ±0.023 ±0.013 Neural Net 0.664 0.850 0.812 0.763 0.761 0.697 ±0.044 ±0.031 ±0.016 ±0.025 ±0.031 ±0.033 Random Forest 0.639 0.878 0.814 0.811 0.804 0.745 ±0.059 ±0.022 ±0.023 ±0.011 ±0.014 ±0.014 SVC 0.633 0.849 0.815 0.772 0.749 0.697 ±0.060 ±0.029 ±0.021 ±0.016 ±0.022 ±0.017 Decision Tree 0.605 0.802 0.800 0.789 0.739 0.719 ±0.068 ±0.047 ±0.021 ±0.019 ±0.015 ±0.025 Naive Bayes 0.485 0.550 0.377 0.284 0.673 0.466 ±0.005 ±0.010 ±0.014 ±0.051 ±0.032 ±0.021 Table 11: predicting passing 211 from unique IDs for both courses and using either continuous or discrete